Open conceptofmind opened 1 year ago
I have the similar problem, maybe it is caused by requesting too much. I got 'slow down' msg when I access the link that raised in my browser.
I am trying to download the dataset to reproduce the results from the Toolformer paper. I have been struggling with this dataset for a while. Did you manage to solve the issue and get the data? Maybe by manually downloading the data, and skipping that step of the pipeline? @conceptofmind I am actually using your Toolformer repo for my research, thanks for that :)
Hello,
Thank you for all of your great work. I am trying to just download and process the English dumps from CommonCrawl up to 2023. I have been running into multiple errors.
It seems as if the link to download from cc has changed to:
https://data.commoncrawl.org/
Some of the header names were changed as well. This fixed those errors:
Finally, running into this other issue:
I have not been able to resolve this error yet.
Any help would be greatly appreciated.
Thank you,
Enrico