facebookresearch / cc_net

Tools to download and cleanup Common Crawl data
MIT License
932 stars 138 forks source link

Error when Running 2020-34 dumps #16

Open Phil1108 opened 3 years ago

Phil1108 commented 3 years ago

When Running the full pipeline with the newest dumps (e.g. 2020-34), there seem to be an issue with the header file format.

It only seem to occur on Texts with non Latin Alphabet. Due to this issue one cannot run the hashing pipeline on some newer dumps. The last successfull dump which I could successfully process was 2020-10.

Are there any quick-fixes available for this problem?

  File "/home/ubuntu/anaconda3/lib/python3.7/logging/__init__.py", line 1025, in emit
    msg = self.format(record)
  File "/home/ubuntu/anaconda3/lib/python3.7/logging/__init__.py", line 869, in format
    return fmt.format(record)
  File "/home/ubuntu/anaconda3/lib/python3.7/logging/__init__.py", line 608, in format
    record.message = record.getMessage()
  File "/home/ubuntu/cc_net/cc_net/process_wet_file.py", line 98, in group_by_docs
    parsed = parse_doc(headers, doc)
  File "/home/ubuntu/anaconda3/lib/python3.7/logging/__init__.py", line 369, in getMessage
    msg = msg % self.args
  File "/home/ubuntu/cc_net/cc_net/process_wet_file.py", line 70, in parse_doc
    logger.warning("Can't parse header:", e, headers, doc)
TypeError: not all arguments converted during string formatting
Call stack:
Traceback (most recent call last):
  File "/home/ubuntu/cc_net/cc_net/process_wet_file.py", line 68, in parse_doc
    length = int(headers[8].split()[1])
ValueError: invalid literal for int() with base 10: 'text/plain'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.7/logging/__init__.py", line 1025, in emit
    msg = self.format(record)
  File "/home/ubuntu/anaconda3/lib/python3.7/logging/__init__.py", line 869, in format
    return fmt.format(record)
  File "/home/ubuntu/anaconda3/lib/python3.7/logging/__init__.py", line 608, in format
    record.message = record.getMessage()
  File "/home/ubuntu/anaconda3/lib/python3.7/logging/__init__.py", line 369, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
Message: "Can't parse header:"
Arguments: (ValueError("invalid literal for int() with base 10: 'text/plain'"), ['WARC/1.0', 'WARC-Type: conversion', 'WARC-Target-URI: http://00.auto.sohu.com/d/details?cityCode=321000&planId=1622&trimId=147575&rd=0', 'WARC-Date: 2020-08-04T02:58:40Z', 'WARC-Record-ID: <urn:uuid:b941a87a-cb63-49f2-8fcb-792b4e90e803>', 'WARC-Refers-To: <urn:uuid:3360da81-ad19-498f-b94a-5f5e52dc5ef4>', 'WARC-Block-Digest: sha1:N2PD5RJ7SNBYO4IV27IGIPF5LO63UZQK', 'WARC-Identified-Content-Language: zho,eng', 'Content-Type: text/plain', 'Content-Length: 4476', ''], ['【扬州|沃尔沃(进口) XC90 2020款 T5四驱智行豪华版 5座】_贷款买车_零零购车|搜狐汽车', '搜狐汽车零零购车', '扬州', '首页 > 沃尔沃 XC90', '沃尔沃(进口) XC90', '2020款 T5四驱智行豪华版 5座', '更换车款', '2.0T涡轮增压 254马力', '2020款 T5四驱智行豪华版 7座', '2.0T涡轮+机械增压 310马力', '2020款 改款 T6四驱智逸豪华版 7座', '2020款 改款 T6四驱智逸运动版 7座', '2020款 改款 T6四驱智雅豪华版 7座', '2020款 改款 T6四驱智雅运动版 7座', '2020款 改款 T6四驱智尊豪华版 7座', '2.0T涡轮+机械增压 320马力', '2020款 T6四驱智逸豪华版 7座', '2020款 T6四驱智逸运动版 7座', '2020款 T6四驱智雅豪华版 7座', '2020款 T6四驱智雅运动版 7座', '2020款 T6四驱智尊豪华版 7座', '扬州 4S店均价:63.39万', '平安车管家', '一成首付 含购置税,送一年保险', '所需材料:身份证,房产证,六个月流水,还款卡', '申请资质:信用记录良好', '总花费:----元', '月供:----元', '首付--%', '----元', '月供----X--期', '月利率----%', '尾款--%', '----元', '首付:', '10%', '期限:', '48期', '立即申请', '办理时需另付保证金----万元(车款报价 X --%)总花费中不包含税费、保险。车款报价随市场行情随时波动。以上价格仅供参考,以实际合同为准。', '方案详情', '[产品优势]', '门槛低:一成首付,含购置税,送一年保险', '月供无压力:低月供,轻松还款无压力', '省时省心:一站式服务,海量新车,无忧上牌', '灵活选择:可买可退', '[套餐介绍]', '常见问题:', '1、平安车管家售后怎么样?', '与正常购车一致,在品牌方授权4S店进行维修保养', '2、提车城市?:', '免运费仅限一下提车城市:济南,南京,苏州,武汉,长沙,成都,昆明。郑州,东莞,南宁,南通。具体提车城市需根据车型售卖情况和活动决定,详情请咨询客服。', '3、关于上牌:', '合同期内,汽车上平安租赁的牌照,按照合同约定支付尾款,车辆将过户给到您。', '4、所需资料:', '10万<贷款额<=30万:二证一卡(身份证+房产证或半年以上银行流水+还款卡)', '贷款额>30万:三证一卡(身份证+房产证+半年以上银行流水+还款卡)', '备注:', '首付、月供金额仅供参考,实际贷款金额将包含购置税、保险等费用,详情可咨询购车顾问,联系电话:021-20662667', '购车流程', '提交申请', '电话回访', '提交材料', '审核通过', '提车上户', '按期还款', '常见问题', 'Q页面中的价格是如何计算的?', 'A页面中计算的首付额、月供额等信息,是以您提车城市的4S店平均报价为准计算的。此价格随市场行情随时波动,仅供参考。了解更精确的价格,您可在页面中填写您的联系方式,我们的工作人员会与您沟通。', 'Q办理分期购车有什么要求?', 'A办理分期购车要求您是18岁以上的中国公民,具有一定的还款能力,且需要提交相应的证明材料。不同金融方案要求的资质不同,您可以在方案详情中查看具体的要求,找到最适合自己的金融方案。', 'Q申请贷款时,材料审核需要多久?', 'A零零购车不同的金融方案由于所需材料的不同,材料审核的时间也不同。所需材料齐全后,服务商会立即提交审核,大部分方案在2-24小时即可出审核结果,并第一时间放款。', 'Q车辆的上牌如何办理?', 'A在您的贷款申请通过审批,提车时,车辆的交税、上牌等业务会有专业的客服人员为您统一办理,让您不再被复杂的手续所困扰。', '看过XC90的还看过', '沃尔沃XC60', '月供 9254元起', '奔驰GLC级', '月供 10067元起', '奥迪Q7', '月供 17590元起', '大众途锐', '月供 16060元起', '沃尔沃V90 Cross Country', '月供 11319元起', '丰田普拉多', '月供 12617元起', '完善您的信息,车贷申请极速审核', '×', '姓名:', '请填写您的真实姓名', '手机号:', '请填写常用手机号码', '提车地:', '扬州', '请选择提车城市', '我已阅读并同意 《搜狐汽车隐私政策》', '请同意《搜狐汽车隐私政策》', '提交申请', '×', '您的申请已成功提交,我们会尽快处理!', '关于我们', 'Copyright © Sohu.com Inc. All Rights Reserved. 搜狐公司 版权所有', '免责声明 | 搜狐不良信息举报邮箱:jubao@contact.sohu.com', '客服:业务咨询、投诉建议', '279530178', '战略合作、代理商加盟010-61134396', '周一至周五 9:30-18:30', '反馈 顶部'])
gwenzek commented 3 years ago

Thanks for flagging. It seems that CC 2020-34 has added a new header: "WARC-Identified-Content-Language". Instead of using a WARC library I rolled my simplified version in https://github.com/facebookresearch/cc_net/blob/master/cc_net/process_wet_file.py#L57 to specialize to CC archive. I'll need to introduce something more robust here (maybe just use the library, but I have to be careful with paragraphs numbering, otherwise I might break CC100 script).

acul3 commented 3 years ago

@gwenzek this also happen in dumps 2020-24 , 2020-29, 2020-40, 2020-45, 2020-50,

chirico85 commented 2 years ago

2022-05 as well. Any news here?

shmpanski commented 1 year ago

You can replace https://github.com/facebookresearch/cc_net/blob/main/cc_net/process_wet_file.py#L73-L79 with

headers_map = {}

for header in headers[1:]:
    if not header:
        continue
    key, value = header.split(": ", 1)
    headers_map[key] = value

warc_type = headers_map["WARC-Type"]
if warc_type != "conversion":
    return None
url = headers_map["WARC-Target-URI"]
date = headers_map["WARC-Date"]
digest = headers_map["WARC-Block-Digest"]
length = int(headers_map["Content-Length"])

in order to carefully process a new added header