Closed lecifire closed 3 months ago
I also met the same issue. I trained OLMo-1B with the provided config files. The batch size is:
global_train_batch_size: 2048
device_train_microbatch_size: 8
I used 8 NVIDIA A100 GPUs within one node. My flash attention version is flash_attn-2.5.9.post1+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64
It seems that the bug appears randomly. I tried the training for 3 times, with the command SCRATCH_DIR=<my-specific-path> torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml
. And the error appears at the 9th/3rd/5th step, respectively.
I am wondering if there is anyone could help. Thanks.
Hi i managed to resolve the issue. So it appears that the dataloader was unable to grab the dataset token fast enough while training which resulted in an error for the threadhandler. We migrated our dataset to the same region for the cluster and also upgraded it to premium blob for lower latency and it seem to have resolved our problem.
🐛 Describe the bug
I've tried training the OLMo1b using the train script on Azure ML using the NC96ads cluster of A100s consisting of 2 nodes, 4GPUs each. I then encountered this following error after like 6steps, would need help to debug this, I chose not to use flashattention and did not modify my batchsize or micro batchsize from the config yaml file.
Traceback (most recent call last): File "/mnt/azureml/cr/j/e83d0122ec494c7dbf7572c30a51c53b/exe/wd/scripts/train.py", line 300, in
main(cfg)
File "/mnt/azureml/cr/j/e83d0122ec494c7dbf7572c30a51c53b/exe/wd/scripts/train.py", line 272, in main
trainer.fit()
File "/workspace/OLMo/olmo/train.py", line 1053, in fit
for batch in self.train_loader:
File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 631, in next
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
data.append(next(self.dataset_iter))
^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/OLMo/olmo/data/iterable_dataset.py", line 177, in
return (x for x in roundrobin(*thread_generators))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/OLMo/olmo/util.py", line 695, in roundrobin
yield next()
^^^^^^
File "/workspace/OLMo/olmo/util.py", line 679, in threaded_generator
raise OLMoThreadError(f"generator thread {thread_name} failed") from x
olmo.exceptions.OLMoThreadError: generator thread data thread 3 failed
Versions
using the docker.base image