[BUG] ValueError when we use only one dataset

liu-jc commented 1 month ago

Bug report checklist

[x] I provided code that demonstrates a minimal reproducible example.
[x] I confirmed bug exists on the latest mainline of Chronos via source install.

Describe the bug

When I put a single dataset in the config file like the following:

# List of training data files
training_data_paths:
- "/path/to/kernelsynth-data.arrow"
# Mixing probability of each dataset file
probability:
- 1.0

I would face ValueError:

  File "/export/home/anaconda/envs/chronos/lib/python3.11/site-packages/accelerate/data_loader.py", line 631, in _fetch_batches
    batches.append(next(iterator))
                   ^^^^^^^^^^^^^^
  File "/export/home/anaconda/envs/chronos/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/export/home/anaconda/envs/chronos/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1326, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/anaconda/envs/chronos/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/export/home/anaconda/envs/chronos/lib/python3.11/site-packages/torch/_utils.py", line 705, in reraise
    raise exception
ValueError: Caught ValueError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "/export/home/anaconda/envs/chronos/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/export/home/anaconda/envs/chronos/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
    data.append(next(self.dataset_iter))
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/chronos-forecasting/scripts/training/train.py", line 243, in __iter__
    for element in self.base_dataset:
  File "/export/home/chronos-forecasting/scripts/training/train.py", line 493, in __iter__
    idx = np.random.choice(range(len(iterators)), p=probs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "numpy/random/mtrand.pyx", line 951, in numpy.random.mtrand.RandomState.choice
ValueError: 'a' cannot be empty unless no samples are taken

Basically, this is because the probs is an empty list: probs: [], iterables: []. I am not sure why it would be empty. I think this might be bug but not sure if any one else faced the same issue?

Expected behavior

I think it should run smoothly.

To reproduce

Full script:

context_length: 512
prediction_length: 64
min_past: 60
max_steps: 200_000
save_steps: 100_000
log_steps: 500
per_device_train_batch_size: 128
learning_rate: 0.001
optim: adamw_torch_fused
num_samples: 20
shuffle_buffer_length: 100_000
gradient_accumulation_steps: 1
model_id: google/t5-efficient-tiny
model_type: seq2seq
random_init: true
tie_embeddings: true
output_dir: chronos_output/output-tiny_only_synth/
tf32: true
torch_compile: true
tokenizer_class: "MeanScaleUniformBins"
tokenizer_kwargs:
  low_limit: -15.0
  high_limit: 15.0
n_tokens: 4096
lr_scheduler_type: linear
warmup_ratio: 0.0
dataloader_num_workers: 11
max_missing_prop: 0.9
use_eos_token: true
training_data_paths:
- "synth-data/kernelsynth-data.arrow"
probability:
- 1.0

Environment description Operating system: Python version: Python 3.11.5 PyTorch version: 2.3.1+cu121 HuggingFace transformers version: 4.41.2 HuggingFace accelerate version: 0.30.1

lostella commented 1 month ago

@liu-jc thanks for opening this, I could reproduce it. Looks like the culprit is

dataloader_num_workers: 11

If I set instead

dataloader_num_workers: 1

then everything runs fine. Maybe one check we should do internally is that the number of worker processes set does not exceed the number of datasets provided.

liu-jc commented 1 month ago

Hi @lostella,

Thanks for the quick reply! I also just found that with more datasets, it works fine. Thought maybe num_workers problems, but haven't tried to reduce it. Thanks for confirming the solution.

lostella commented 1 month ago

@liu-jc could you confirm that #157 is the required fix? Thanks!

liu-jc commented 1 month ago

Hi @lostella,

I tested it. It works with your fix :)

amazon-science / chronos-forecasting

[BUG] ValueError when we use only one dataset #154