Open sebastian-nagel opened 1 year ago
The WARC-Target-URI (from the WARC file) https://esfsport.ir/1173-دختران-والیبالیست-اصفهان-قهرمان-کشور-شدند.html looses all Unicode characters during WAT/WET extraction. Here the corresponding WAT file:
https://esfsport.ir/1173-دختران-والیبالیست-اصفهان-قهرمان-کشور-شدند.html
WARC/1.0 WARC-Type: metadata WARC-Target-URI: https://esfsport.ir/1173------.html ... ...,"WARC-Target-URI":"https://esfsport.ir/1173------.html"}}}
These URLs result from redirects which are deliberately not normalized. To address the issue:
Quick estimate of the impact of this bug: < 0.05% of WAT/WET records
The WARC-Target-URI (from the WARC file)
https://esfsport.ir/1173-دختران-والیبالیست-اصفهان-قهرمان-کشور-شدند.html
looses all Unicode characters during WAT/WET extraction. Here the corresponding WAT file:These URLs result from redirects which are deliberately not normalized. To address the issue:
Quick estimate of the impact of this bug: < 0.05% of WAT/WET records