Non-ASCII/UTF-8 characters lost in WARC-Target-URI during WAT/WET extraction

The WARC-Target-URI (from the WARC file) https://esfsport.ir/1173-دختران-والیبالیست-اصفهان-قهرمان-کشور-شدند.html looses all Unicode characters during WAT/WET extraction. Here the corresponding WAT file:

WARC/1.0
WARC-Type: metadata
WARC-Target-URI: https://esfsport.ir/1173------.html
...

...,"WARC-Target-URI":"https://esfsport.ir/1173------.html"}}}

These URLs result from redirects which are deliberately not normalized. To address the issue:

use URI.toASCIIString() when writing WARC files - URI.toString() converts the URI to a string without percent-encoding the Unicode characters
try to fix the WAT/WET extractor to scope with these URLs

Quick estimate of the impact of this bug: < 0.05% of WAT/WET records

commoncrawl / ia-web-commons

Non-ASCII/UTF-8 characters lost in WARC-Target-URI during WAT/WET extraction #27