Open AvisP opened 3 months ago
@AvisP could you check if the fix proposed in #156 makes it work for you?
@lostella added the freeze_support()
after this line but not working. In the example from python website that you shared, there is a call to Process
which I don't see happening in the training code, maybe it needs to be inserted before that?
I tried to get wsl on windows and make it from work from there but unfortunately it is not working properly.
@AvisP can you share the exact config.yaml
that you're using?
Sure here it is. I tried with two datasets also and setting probability to 0.9,0.1
training_data_paths:
# - "D://Chronos-Finetune//noise-data.arrow"
- "D://Chronos-Finetune//kernelsynth-data.arrow"
probability:
- 1.0
# - 0.1
context_length: 512
prediction_length: 64
min_past: 60
max_steps: 200_000
save_steps: 100_000
log_steps: 500
per_device_train_batch_size: 32
learning_rate: 0.001
optim: adamw_torch_fused
num_samples: 20
shuffle_buffer_length: 100_000
gradient_accumulation_steps: 1
model_id: google/t5-efficient-small
model_type: seq2seq
random_init: true
tie_embeddings: true
output_dir: ./output/
tf32: true
torch_compile: true
tokenizer_class: "MeanScaleUniformBins"
tokenizer_kwargs:
low_limit: -15.0
high_limit: 15.0
n_tokens: 4096
lr_scheduler_type: linear
warmup_ratio: 0.0
dataloader_num_workers: 1
max_missing_prop: 0.9
use_eos_token: true
Config looks okay to me. Could you try the following and try again?
torch_compile: false
.dataloader_num_workers: 0
.Let's use just one kernel synth dataset like you have.
It is running now after making these two changes. Does setting the dataloader_num_workers to 0 cause any slow down of data laoding process? I will try out the evaluation script next. Thanks for your time!
@AvisP This looks like a multiprocessing on Windows issue. Setting dataloader_num_workers=0
may lead to some loss in training speed.
I'm having this issue on macos, setting dataloader_num_workers=0 does "fix" it.
The only difference is that it crash at:
TypeError: no default __reduce__ due to non-trivial __cinit__
@RemiKalbe did you convert the dataset into GluonTS-style arrow format correctly, as described in the readme?
Bug report checklist
Describe the bug An error happens when executing the training script on dataset generated using the process mentioned here. Data files used can be downloaded from here Issue is similar to https://github.com/amazon-science/chronos-forecasting/issues/149. The error message is shown below
Expected behavior Training/fine tuning should proceed smoothly
To reproduce
python train.py --config chronos-t5-small.yaml
Environment description Operating system: Windows 11 CUDA version: 12.4 NVCC version: cuda_12.3.r12.3/compiler.33567101_0 PyTorch version: 2.3.1+cu121 HuggingFace transformers version: 4.42.4 HuggingFace accelerate version: 0.32.1