commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

Non-ASCII/UTF-8 characters lost in WARC-Target-URI during WAT/WET extraction #27

Open sebastian-nagel opened 1 year ago

sebastian-nagel commented 1 year ago

The WARC-Target-URI (from the WARC file) https://esfsport.ir/1173-دختران-والیبالیست-اصفهان-قهرمان-کشور-شدند.html looses all Unicode characters during WAT/WET extraction. Here the corresponding WAT file:

WARC/1.0
WARC-Type: metadata
WARC-Target-URI: https://esfsport.ir/1173------.html
...

...,"WARC-Target-URI":"https://esfsport.ir/1173------.html"}}}

These URLs result from redirects which are deliberately not normalized. To address the issue:

  1. use URI.toASCIIString() when writing WARC files - URI.toString() converts the URI to a string without percent-encoding the Unicode characters
  2. try to fix the WAT/WET extractor to scope with these URLs

Quick estimate of the impact of this bug: < 0.05% of WAT/WET records