abertsch72 / unlimiformer

Public repo for the NeurIPS 2023 paper "Unlimiformer: Long-Range Transformers with Unlimited Length Input"
MIT License
1.05k stars 77 forks source link

DatasetGenerationError #65

Closed pppyb closed 2 months ago

pppyb commented 2 months ago

Hi,

Thank you for this great effort.

When attempting to reproduce the results, I ran the following command:

python src/run.py \
    src/configs/training/base_training_args.json \
    src/configs/data/gov_report.json \
    --output_dir output_train_bart_base_local/ \
    --learning_rate 1e-5 \
    --model_name_or_path facebook/bart-base \
    --eval_steps 1000 --save_steps 1000 \
    --per_device_eval_batch_size 1 --per_device_train_batch_size 2 \
    --extra_metrics bertscore \
    --unlimiformer_training \
    --max_source_length 16384 \
    --test_unlimiformer --eval_max_source_length 999999 --do_eval=True \
    > output/output$(date +%s).txt

I encountered the following error:

Generating train split: 0 examples [00:00, ? examples/s]
Traceback (most recent call last):
  File "/home/yibop/miniconda3/envs/unlimiformer/lib/python3.10/site-packages/datasets/builder.py", line 1750, in _prepare_split_single
    for key, record in generator:
  File "/home/yibop/.cache/huggingface/modules/datasets_modules/datasets/tau--sled/b5fab54723c8a515071f8b983dcb93519ae71beced5ad96f722cd22d91047229/sled.py", line 609, in _generate_examples
    for key, row in gen:
  File "/home/yibop/.cache/huggingface/modules/datasets_modules/datasets/tau--sled/b5fab54723c8a515071f8b983dcb93519ae71beced5ad96f722cd22d91047229/sled.py", line 520, in _scrolls_gen
    with open(data_file, encoding="utf-8") as f:
  File "/home/yibop/miniconda3/envs/unlimiformer/lib/python3.10/site-packages/datasets/streaming.py", line 75, in wrapper
    return function(*args, download_config=download_config, **kwargs)
  File "/home/yibop/miniconda3/envs/unlimiformer/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 1222, in xopen
    return open(main_hop, mode, *args, **kwargs)
NotADirectoryError: [Errno 20] Not a directory: '/home/yibop/.cache/huggingface/datasets/downloads/73eca96a974f65c46cdf67acc0d23b976b9c57ce310d35ad7cfda8b6dc67001d/gov_report/train.jsonl'

Traceback (most recent call last):
  File "/home/yibop/yibop/unlimiformer/src/run.py", line 1180, in <module>
    main()
  File "/home/yibop/yibop/unlimiformer/src/run.py", line 437, in main
    seq2seq_dataset = _get_dataset(data_args, model_args, training_args)
  File "/home/yibop/yibop/unlimiformer/src/run.py", line 943, in _get_dataset
    seq2seq_dataset = load_dataset(
  File "/home/yibop/miniconda3/envs/unlimiformer/lib/python3.10/site-packages/datasets/load.py", line 2616, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/miniconda3/envs/unlimiformer/lib/python3.10/site-packages/datasets/builder.py", line 1029, in download_and_prepare
    self._download_and_prepare(
  File "/home/miniconda3/envs/unlimiformer/lib/python3.10/site-packages/datasets/builder.py", line 1791, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/miniconda3/envs/unlimiformer/lib/python3.10/site-packages/datasets/builder.py", line 1124, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/miniconda3/envs/unlimiformer/lib/python3.10/siteables/datasets/builder.py", line 1629, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/home/miniconda3/envs/unlimiformer/lib python3.10/site-packages/datasets/builder.py", line 1786, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

I tried the common solution found online: deleting the entire dataset cache directory

/home/.cache/huggingface/datasets

but it still doesn't work. Any ideas on what could be causing this issue?

pppyb commented 2 months ago

I have contacted the authors of the dataset and this issue has been resolved.