[BUG] EOF error in pickle when reading arrow file

AvisP commented 3 months ago

Bug report checklist

[X] I provided code that demonstrates a minimal reproducible example.
[X] I confirmed bug exists on the latest mainline of Chronos via source install.

Describe the bug An error happens when executing the training script on dataset generated using the process mentioned here. Data files used can be downloaded from here Issue is similar to https://github.com/amazon-science/chronos-forecasting/issues/149. The error message is shown below

D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\gluonts\json.py:102: UserWarning: Using `json`-module for json-handling. Consider installing one of `orjson`, `ujson` to speed up serialization and deserialization.
  warnings.warn(
2024-07-22 23:38:15,404 - D:\Chronos-Finetune\train.py - INFO - Using SEED: 3565056063
2024-07-22 23:38:15,404 - D:\Chronos-Finetune\train.py - INFO - Logging dir: output\run-7
2024-07-22 23:38:15,404 - D:\Chronos-Finetune\train.py - INFO - Loading and filtering 2 datasets for training: ['D://Chronos-Finetune//noise-data.arrow', 'D://Chronos-Finetune//kernelsynth-data.arrow']
2024-07-22 23:38:15,404 - D:\Chronos-Finetune\train.py - INFO - Mixing probabilities: [0.9, 0.1]
2024-07-22 23:38:15,418 - D:\Chronos-Finetune\train.py - INFO - Initializing model
2024-07-22 23:38:15,418 - D:\Chronos-Finetune\train.py - INFO - Using random initialization
max_steps is given, it will override any value given in num_train_epochs
2024-07-22 23:38:16,324 - D:\Chronos-Finetune\train.py - INFO - Training
  0%|                                                                                                                                                                                                                    | 0/200000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "D:\Chronos-Finetune\train.py", line 692, in <module>
    app()
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\main.py", line 326, in __call__
    raise e
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\main.py", line 309, in __call__
    return get_command(self)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\core.py", line 661, in main
    return _main(
           ^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\core.py", line 193, in _main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\main.py", line 692, in wrapper
    return callback(**use_params)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer_config\decorators.py", line 92, in wrapped
    return cmd(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\train.py", line 679, in main
    trainer.train()
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\transformers\trainer.py", line 1932, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\transformers\trainer.py", line 2230, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\accelerate\data_loader.py", line 671, in __iter__
    main_iterator = super().__iter__()
                    ^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\torch\utils\data\dataloader.py", line 439, in __iter__
    return self._get_iterator()
           ^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\torch\utils\data\dataloader.py", line 387, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\torch\utils\data\dataloader.py", line 1040, in __init__
    w.start()
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\context.py", line 336, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\popen_spawn_win32.py", line 95, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "<stringsource>", line 2, in pyarrow.lib._RecordBatchFileReader.__reduce_cython__
TypeError: no default __reduce__ due to non-trivial __cinit__
  0%|                                                                                                                                                                                                                    | 0/200000 [00:00<?, ?it/s]

(Chronos_venv) D:\Chronos-Finetune>D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\gluonts\json.py:102: UserWarning: Using `json`-module for json-handling. Consider installing one of `orjson`, `ujson` to speed up serialization and deserialization.
  warnings.warn(
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 132, in _main
    self = reduction.pickle.load(from_parent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input

Expected behavior Training/fine tuning should proceed smoothly

To reproduce

Download the data files from from here
Change the path location in config script for the data files
Run the training script using python train.py --config chronos-t5-small.yaml

Environment description Operating system: Windows 11 CUDA version: 12.4 NVCC version: cuda_12.3.r12.3/compiler.33567101_0 PyTorch version: 2.3.1+cu121 HuggingFace transformers version: 4.42.4 HuggingFace accelerate version: 0.32.1

lostella commented 3 months ago

This seems relevant, see also the first answer here.

TLDR: we probably need to add freeze_support() after if __name__ == "__main__": in the training script

lostella commented 3 months ago

@AvisP could you check if the fix proposed in #156 makes it work for you?

AvisP commented 3 months ago

@lostella added the freeze_support() after this line but not working. In the example from python website that you shared, there is a call to Process which I don't see happening in the training code, maybe it needs to be inserted before that?

I tried to get wsl on windows and make it from work from there but unfortunately it is not working properly.

abdulfatir commented 3 months ago

@AvisP can you share the exact config.yaml that you're using?

AvisP commented 3 months ago

Sure here it is. I tried with two datasets also and setting probability to 0.9,0.1

training_data_paths:
# - "D://Chronos-Finetune//noise-data.arrow"
- "D://Chronos-Finetune//kernelsynth-data.arrow"
probability:
- 1.0
# - 0.1
context_length: 512
prediction_length: 64
min_past: 60
max_steps: 200_000
save_steps: 100_000
log_steps: 500
per_device_train_batch_size: 32
learning_rate: 0.001
optim: adamw_torch_fused
num_samples: 20
shuffle_buffer_length: 100_000
gradient_accumulation_steps: 1
model_id: google/t5-efficient-small
model_type: seq2seq
random_init: true
tie_embeddings: true
output_dir: ./output/
tf32: true
torch_compile: true
tokenizer_class: "MeanScaleUniformBins"
tokenizer_kwargs:
  low_limit: -15.0
  high_limit: 15.0
n_tokens: 4096
lr_scheduler_type: linear
warmup_ratio: 0.0
dataloader_num_workers: 1
max_missing_prop: 0.9
use_eos_token: true

abdulfatir commented 3 months ago

Config looks okay to me. Could you try the following and try again?

Set torch_compile: false.
Set dataloader_num_workers: 0.

Let's use just one kernel synth dataset like you have.

AvisP commented 3 months ago

It is running now after making these two changes. Does setting the dataloader_num_workers to 0 cause any slow down of data laoding process? I will try out the evaluation script next. Thanks for your time!

abdulfatir commented 3 months ago

@AvisP This looks like a multiprocessing on Windows issue. Setting dataloader_num_workers=0 may lead to some loss in training speed.

RemiKalbe commented 5 days ago

I'm having this issue on macos, setting dataloader_num_workers=0 does "fix" it.

The only difference is that it crash at:

TypeError: no default __reduce__ due to non-trivial __cinit__

abdulfatir commented 4 days ago

@RemiKalbe did you convert the dataset into GluonTS-style arrow format correctly, as described in the readme?

amazon-science / chronos-forecasting

[BUG] EOF error in pickle when reading arrow file #155