allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.48k stars 449 forks source link

OLMoThreadError #552

Open juripapay opened 5 months ago

juripapay commented 5 months ago

❓ The question

Please advise where this error might come from: [2024-04-18 19:06:17] INFO [olmo.train:816, rank=0] [step=75/739328] train/CrossEntropyLoss=7.417 train/Perplexity=1,664 throughput/total_tokens=314,572,800 throughput/device/tokens_per_second=9,407 throughput/device/batches_per_second=0.0022 [2024-04-18 19:10:41] CRITICAL [olmo.util:158, rank=0] Uncaught OLMoThreadError: generator thread data thread 3 failed

prakamya-mishra commented 5 months ago

@juripapay, can you give more details on the size of the model, batch size, GPU(AMD/Nvidia), and flash attention use? I wanted to know more regarding in which setting are you getting a throughout of 9k tokens/GPU/sec.

dumitrac commented 5 months ago

@juripapay - is there a traceback logged after the last line you pasted? I would expect it to log the traceback info, based on this.

lecifire commented 4 months ago

Hi i encountered the same problem, would need some assistance on how to resolve

I tried training on the OLMo1b model. I didn't change anything much in the config yaml

global_train_batch_size: 2048 device_train_microbatch_size: 8 My GPU was A100, using 2nodes with 4GPU each for the azure cluster NC96ads and I didn't use flash attention

Traceback (most recent call last): File "/mnt/azureml/cr/j/e83d0122ec494c7dbf7572c30a51c53b/exe/wd/scripts/train.py", line 300, in main(cfg) File "/mnt/azureml/cr/j/e83d0122ec494c7dbf7572c30a51c53b/exe/wd/scripts/train.py", line 272, in main trainer.fit() File "/workspace/OLMo/olmo/train.py", line 1053, in fit for batch in self.train_loader: File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 631, in next data = self._next_data() ^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch data.append(next(self.dataset_iter)) ^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/OLMo/olmo/data/iterable_dataset.py", line 177, in return (x for x in roundrobin(*thread_generators)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/OLMo/olmo/util.py", line 695, in roundrobin yield next() ^^^^^^ File "/workspace/OLMo/olmo/util.py", line 679, in threaded_generator raise OLMoThreadError(f"generator thread {thread_name} failed") from x olmo.exceptions.OLMoThreadError: generator thread data thread 3 failed

ys-2020 commented 4 months ago

I also met the same issue. I trained OLMo-1B with the provided config files. The batch size is:

global_train_batch_size: 2048
device_train_microbatch_size: 8

I used 8 NVIDIA A100 GPUs within one node. My flash attention version is flash_attn-2.5.9.post1+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64

It seems that the bug appears randomly. I tried the training for 3 times, with the command SCRATCH_DIR=<my-specific-path> torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml. And the error appears at the 9th/3rd/5th step, respectively.

I am wondering if anyone could give some advice. Thanks.