Numerous Errors - Githubissues

conceptofmind commented 1 year ago

Hello,

Thank you for all of your great work. I am trying to just download and process the English dumps from CommonCrawl up to 2023. I have been running into multiple errors.

It seems as if the link to download from cc has changed to: https://data.commoncrawl.org/

Some of the header names were changed as well. This fixed those errors:

        headers_map = {}

        for header in headers[1:]:
            if not header:
                continue
            key, value = header.split(": ", 1)
            headers_map[key] = value

        warc_type = headers_map["WARC-Type"]
        if warc_type != "conversion":
            return None
        url = headers_map["WARC-Target-URI"]
        date = headers_map["WARC-Date"]
        digest = headers_map["WARC-Block-Digest"]
        length = int(headers_map["Content-Length"])

Finally, running into this other issue:

requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-14/segments/1679296943471.24/wet/CC-MAIN-20230320083513-20230320113513-00114.warc.wet.gz

I have not been able to resolve this error yet.

Any help would be greatly appreciated.

Thank you,

Enrico

nbqu commented 1 year ago

I have the similar problem, maybe it is caused by requesting too much. I got 'slow down' msg when I access the link that raised in my browser.

JorgeGF24 commented 1 year ago

I am trying to download the dataset to reproduce the results from the Toolformer paper. I have been struggling with this dataset for a while. Did you manage to solve the issue and get the data? Maybe by manually downloading the data, and skipping that step of the pipeline? @conceptofmind I am actually using your Toolformer repo for my research, thanks for that :)

facebookresearch / cc_net

Numerous Errors #45