facebookresearch / LASER

Language-Agnostic SEntence Representations
Other
3.58k stars 459 forks source link

Can't download: 403 error on some CC segments. #279

Open enn-nafnlaus opened 6 months ago

enn-nafnlaus commented 6 months ago

2024-02-14 21:01 INFO 2048692:root - Downloaded https://dl.fbaipublicfiles.com/laser/CCMatrix/v1.0.0/2020-10_0278.tsv.gz [200] took 8s (5766.4kB/s) 2024-02-14 21:01 INFO 2048692:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-10/segments/1581875145708.59/wet/CC-MAIN-20200222150029-20200222180029-00542.warc.wet.gz 2024-02-14 21:01 INFO 2048693:root - Downloaded https://dl.fbaipublicfiles.com/laser/CCMatrix/v1.0.0/2018-05_0044.tsv.gz [200] took 9s (5267.6kB/s) 2024-02-14 21:01 INFO 2048693:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886639.11/wet/CC-MAIN-20180116184540-20180116204540-00601.warc.wet.gz multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/usr/lib64/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/usr/lib64/python3.10/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) File "/data/LLM_training/translation/cc_net/dl_cc_matrix.py", line 138, in dl_file raw_documents = get_documents(segment) File "/data/LLM_training/translation/cc_net/dl_cc_matrix.py", line 107, in get_documents return {d["digest"]: d["raw_content"] for d in CCSegmentsReader([segment])} File "/data/LLM_training/translation/cc_net/dl_cc_matrix.py", line 107, in return {d["digest"]: d["raw_content"] for d in CCSegmentsReader([segment])} File "/data/LLM_training/translation/cc_net/cc_net/process_wet_file.py", line 199, in iter for doc in parse_warc_file(self.open_segment(segment), self.min_len): File "/data/LLM_training/translation/cc_net/cc_net/process_wet_file.py", line 192, in open_segment return jsonql.open_remote_file(url, cache=file) File "/data/LLM_training/translation/cc_net/cc_net/jsonql.py", line 1124, in open_remote_file raw_bytes = request_get_content(url) File "/data/LLM_training/translation/cc_net/cc_net/jsonql.py", line 1101, in request_get_content raise e File "/data/LLM_training/translation/cc_net/cc_net/jsonql.py", line 1095, in request_get_content r.raise_for_status() File "/home/user/.local/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-20190215183319-20190215205319-00001.warc.wet.gz """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/data/LLM_training/translation/cc_net/dl_cc_matrix.py", line 338, in func_argparse.main(dl, finalize) File "/home/user/.local/lib/python3.10/site-packages/func_argparse/init.py", line 29, in main return make_main(*fns, module=module, description=description)(sys.argv[1:]) File "/home/user/.local/lib/python3.10/site-packages/func_argparse/init.py", line 72, in parse_and_call return command(**parsed_args) File "/data/LLM_training/translation/cc_net/dl_cc_matrix.py", line 103, in dl pool.map(dlf, file_list) File "/usr/lib64/python3.10/multiprocessing/pool.py", line 367, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/usr/lib64/python3.10/multiprocessing/pool.py", line 774, in get raise self._value requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-20190215183319-20190215205319-00001.warc.wet.gz

avidale commented 6 months ago

Hi! If some CommonCrawl files do not exist anymore, I am not sure it would be easy to find them. Have you considered downloading CCMatrix from another storage, such as https://opus.nlpl.eu/CCMatrix/corpus/version/CCMatrix?