Open Phil1108 opened 3 years ago
Thanks for flagging. It seems that CC 2020-34 has added a new header: "WARC-Identified-Content-Language". Instead of using a WARC library I rolled my simplified version in https://github.com/facebookresearch/cc_net/blob/master/cc_net/process_wet_file.py#L57 to specialize to CC archive. I'll need to introduce something more robust here (maybe just use the library, but I have to be careful with paragraphs numbering, otherwise I might break CC100 script).
@gwenzek this also happen in dumps 2020-24 , 2020-29, 2020-40, 2020-45, 2020-50,
2022-05 as well. Any news here?
You can replace https://github.com/facebookresearch/cc_net/blob/main/cc_net/process_wet_file.py#L73-L79 with
headers_map = {}
for header in headers[1:]:
if not header:
continue
key, value = header.split(": ", 1)
headers_map[key] = value
warc_type = headers_map["WARC-Type"]
if warc_type != "conversion":
return None
url = headers_map["WARC-Target-URI"]
date = headers_map["WARC-Date"]
digest = headers_map["WARC-Block-Digest"]
length = int(headers_map["Content-Length"])
in order to carefully process a new added header
When Running the full pipeline with the newest dumps (e.g. 2020-34), there seem to be an issue with the header file format.
It only seem to occur on Texts with non Latin Alphabet. Due to this issue one cannot run the hashing pipeline on some newer dumps. The last successfull dump which I could successfully process was 2020-10.
Are there any quick-fixes available for this problem?