allenai / longformer

Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
Apache License 2.0
2.05k stars 276 forks source link

Error loading data in summarization.py #229

Closed yjqiu closed 2 years ago

yjqiu commented 2 years ago

I tried to run scripts/summarization.py but it failed to load the data. The error is below. Looks like the md5sum is not the same as expected.

Traceback (most recent call last):
  File "scripts/summarization.py", line 354, in <module>
    main(args)
  File "scripts/summarization.py", line 306, in main
    model.hf_datasets = nlp.load_dataset('scientific_papers', 'arxiv')
  File "/opt/conda/envs/longformer/lib/python3.7/site-packages/nlp/load.py", line 549, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/opt/conda/envs/longformer/lib/python3.7/site-packages/nlp/builder.py", line 463, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/opt/conda/envs/longformer/lib/python3.7/site-packages/nlp/builder.py", line 522, in _download_and_prepare
    self.info.download_checksums, dl_manager.get_recorded_sizes_checksums(), "dataset source files"
  File "/opt/conda/envs/longformer/lib/python3.7/site-packages/nlp/utils/info_utils.py", line 38, in verify_checksums
    raise NonMatchingChecksumError(error_msg + str(bad_urls))
nlp.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://drive.google.com/uc?id=1b3rmCSIoh6VhD4HKWjI4HOW-cSwcwbeC&export=download', 'https://drive.google.com/uc?id=1lvsqvsFi3W-pE1SqNZI0s8NR9rC1tsja&export=download']

I then tried to ignore verification steps by ignore_verifications=True and there is another error.

Traceback (most recent call last):
  File "/opt/conda/envs/longformer/lib/python3.7/site-packages/nlp/builder.py", line 537, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/opt/conda/envs/longformer/lib/python3.7/site-packages/nlp/builder.py", line 810, in _prepare_split
    for key, record in utils.tqdm(generator, unit=" examples", total=split_info.num_examples, leave=False):
  File "/opt/conda/envs/longformer/lib/python3.7/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/opt/conda/envs/longformer/lib/python3.7/site-packages/nlp/datasets/scientific_papers/9e4f2cfe3d8494e9f34a84ce49c3214605b4b52a3d8eb199104430d04c52cc12/scientific_papers.py", line 108, in _generate_examples
    with open(path, encoding="utf-8") as f:
NotADirectoryError: [Errno 20] Not a directory: '/home/username/.cache/huggingface/datasets/downloads/c0deae7af7d9c87f25dfadf621f7126f708d7dcac6d353c7564883084a000076/arxiv-dataset/train.txt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "scripts/summarization.py", line 354, in <module>
    main(args)
  File "scripts/summarization.py", line 306, in main
    model.hf_datasets = nlp.load_dataset('scientific_papers', 'arxiv', ignore_verifications=True)
  File "/opt/conda/envs/longformer/lib/python3.7/site-packages/nlp/load.py", line 549, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/opt/conda/envs/longformer/lib/python3.7/site-packages/nlp/builder.py", line 463, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/opt/conda/envs/longformer/lib/python3.7/site-packages/nlp/builder.py", line 539, in _download_and_prepare
    raise OSError("Cannot find data file. " + (self.manual_download_instructions or ""))
OSError: Cannot find data file.
yjqiu commented 2 years ago

Looks like you need to use load_dataset from datasets instead of nlp to resolve the problem.