allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.48k stars 449 forks source link

OLMoThreadError #591

Closed lecifire closed 3 months ago

lecifire commented 4 months ago

🐛 Describe the bug

I've tried training the OLMo1b using the train script on Azure ML using the NC96ads cluster of A100s consisting of 2 nodes, 4GPUs each. I then encountered this following error after like 6steps, would need help to debug this, I chose not to use flashattention and did not modify my batchsize or micro batchsize from the config yaml file.

Traceback (most recent call last): File "/mnt/azureml/cr/j/e83d0122ec494c7dbf7572c30a51c53b/exe/wd/scripts/train.py", line 300, in main(cfg) File "/mnt/azureml/cr/j/e83d0122ec494c7dbf7572c30a51c53b/exe/wd/scripts/train.py", line 272, in main trainer.fit() File "/workspace/OLMo/olmo/train.py", line 1053, in fit for batch in self.train_loader: File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() ^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch data.append(next(self.dataset_iter)) ^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/OLMo/olmo/data/iterable_dataset.py", line 177, in return (x for x in roundrobin(*thread_generators)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/OLMo/olmo/util.py", line 695, in roundrobin yield next() ^^^^^^ File "/workspace/OLMo/olmo/util.py", line 679, in threaded_generator raise OLMoThreadError(f"generator thread {thread_name} failed") from x olmo.exceptions.OLMoThreadError: generator thread data thread 3 failed

Versions

using the docker.base image

ys-2020 commented 4 months ago

I also met the same issue. I trained OLMo-1B with the provided config files. The batch size is:

global_train_batch_size: 2048
device_train_microbatch_size: 8

I used 8 NVIDIA A100 GPUs within one node. My flash attention version is flash_attn-2.5.9.post1+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64

It seems that the bug appears randomly. I tried the training for 3 times, with the command SCRATCH_DIR=<my-specific-path> torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml. And the error appears at the 9th/3rd/5th step, respectively.

I am wondering if there is anyone could help. Thanks.

lecifire commented 4 months ago

Hi i managed to resolve the issue. So it appears that the dataloader was unable to grab the dataset token fast enough while training which resulted in an error for the threadhandler. We migrated our dataset to the same region for the cluster and also upgraded it to premium blob for lower latency and it seem to have resolved our problem.