I use the default config configs/official/OLMo-1B.yaml and remove the wandb config. Then training the model at 8*A800. And run the cmd torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml for training.
There is no error for the training at the starting steps. But will pop up the error message after few steps of training after ~10 minutes.
[2024-08-18 12:26:53] INFO [olmo.train:966, rank=0] [step=1/739328,epoch=0]
optim/total_grad_norm=9.355
train/CrossEntropyLoss=11.35
train/Perplexity=84,618
throughput/total_tokens=4,194,304
throughput/total_training_Gflops=6,215,264
throughput/total_training_log_Gflops=15.64
System/Peak GPU Memory (MB)=42,083
[2024-08-18 12:28:06] INFO [olmo.train:966, rank=0] [step=2/739328,epoch=0]
optim/total_grad_norm=59.20
train/CrossEntropyLoss=10.57
train/Perplexity=38,880
throughput/total_tokens=8,388,608
throughput/total_training_Gflops=12,430,528
throughput/total_training_log_Gflops=16.34
throughput/device/tokens_per_second=20,592
throughput/device/batches_per_second=0.0393
System/Peak GPU Memory (MB)=43,259
[2024-08-18 12:29:21] INFO [olmo.train:966, rank=0] [step=3/739328,epoch=0]
optim/total_grad_norm=28.37
train/CrossEntropyLoss=10.70
train/Perplexity=44,302
throughput/total_tokens=12,582,912
throughput/total_training_Gflops=18,645,793
throughput/total_training_log_Gflops=16.74
throughput/device/tokens_per_second=10,506
throughput/device/batches_per_second=0.0200
[2024-08-18 12:30:37] INFO [olmo.train:966, rank=0] [step=4/739328,epoch=0]
optim/total_grad_norm=9.530
train/CrossEntropyLoss=11.06
train/Perplexity=63,518
throughput/total_tokens=16,777,216
throughput/total_training_Gflops=24,861,057
throughput/total_training_log_Gflops=17.03
throughput/device/tokens_per_second=8,949
throughput/device/batches_per_second=0.0171
[2024-08-18 12:31:54] INFO [olmo.train:966, rank=0] [step=5/739328,epoch=0]
optim/total_grad_norm=17.78
train/CrossEntropyLoss=10.69
train/Perplexity=44,126
throughput/total_tokens=20,971,520
throughput/total_training_Gflops=31,076,321
throughput/total_training_log_Gflops=17.25
throughput/device/tokens_per_second=8,271
throughput/device/batches_per_second=0.0158
[2024-08-18 12:33:13] INFO [olmo.train:966, rank=0] [step=6/739328,epoch=0]
optim/total_grad_norm=7.477
train/CrossEntropyLoss=10.23
train/Perplexity=27,666
throughput/total_tokens=25,165,824
throughput/total_training_Gflops=37,291,586
throughput/total_training_log_Gflops=17.43
throughput/device/tokens_per_second=7,890
throughput/device/batches_per_second=0.0151
[2024-08-18 12:33:54] CRITICAL [olmo.util:163, rank=5] Uncaught OLMoThreadError: generator thread data thread 0 failed
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/aaabbb/projects/OLMo/olmo/util.py:709 in fill_queue │
│ │
│ 706 │ │
│ 707 │ def fill_queue(): │
│ 708 │ │ try: │
│ ❱ 709 │ │ │ for value in g: │
│ 710 │ │ │ │ q.put(value) │
│ 711 │ │ except Exception as e: │
│ 712 │ │ │ q.put(e) │
│ │
│ /data/aaabbb/projects/OLMo/olmo/data/iterable_dataset.py:174 in <genexpr> │
│ │
│ 171 │ │ │ │
│ 172 │ │ │ thread_generators = [] │
│ 173 │ │ │ for i in range(num_threads): │
│ ❱ 174 │ │ │ │ generator = (self._get_dataset_item(int(idx)) for idx in indices[i::num_th │
│ 175 │ │ │ │ thread_generators.append( │
│ 176 │ │ │ │ │ threaded_generator(generator, maxsize=queue_size, thread_name=f"data t │
│ 177 │ │ │ │ ) │
│ │
│ /data/aaabbb/projects/OLMo/olmo/data/iterable_dataset.py:184 in _get_dataset_item │
│ │
│ 181 │ │ │ return (self._get_dataset_item(int(idx)) for idx in indices) │
│ 182 │ │
│ 183 │ def _get_dataset_item(self, idx: int) -> Dict[str, Any]: │
│ ❱ 184 │ │ item = self.dataset[idx] │
│ 185 │ │ if isinstance(item, dict): │
│ 186 │ │ │ return dict(**item, index=idx) │
│ 187 │ │ else: │
│ │
│ /data/aaabbb/projects/OLMo/olmo/data/memmap_dataset.py:196 in __getitem__ │
│ │
│ 193 │ │ │ raise IndexError(f"{index} is out of bounds for dataset of size {len(self)}") │
│ 194 │ │ │
│ 195 │ │ # Read the data from file. │
│ ❱ 196 │ │ input_ids = self._read_chunk_from_memmap(self._memmap_paths[memmap_index], memmap_ │
│ 197 │ │ out: Dict[str, Any] = {"input_ids": input_ids} │
│ 198 │ │ if self.instance_filter_config is not None: │
│ 199 │ │ │ out["instance_mask"] = self._validate_instance(input_ids) │
│ │
│ /data/aaabbb/projects/OLMo/olmo/data/memmap_dataset.py:162 in _read_chunk_from_memmap │
│ │
│ 159 │ │ item_size = dtype(0).itemsize │
│ 160 │ │ bytes_start = index * item_size * self._chunk_size │
│ 161 │ │ num_bytes = item_size * self._chunk_size │
│ ❱ 162 │ │ buffer = get_bytes_range(path, bytes_start, num_bytes) │
│ 163 │ │ array = np.frombuffer(buffer, dtype=dtype) │
│ 164 │ │ if dtype == np.bool_: │
│ 165 │ │ │ return torch.tensor(array) │
│ │
│ /data/aaabbb/projects/OLMo/olmo/util.py:375 in get_bytes_range │
│ │
│ 372 │ │ │ │ parsed.scheme, parsed.netloc, parsed.path.strip("/"), bytes_start, num_byt │
│ 373 │ │ │ ) │
│ 374 │ │ elif parsed.scheme in ("http", "https"): │
│ ❱ 375 │ │ │ return _http_get_bytes_range( │
│ 376 │ │ │ │ parsed.scheme, parsed.netloc, parsed.path.strip("/"), bytes_start, num_byt │
│ 377 │ │ │ ) │
│ 378 │ │ elif parsed.scheme == "file": │
│ │
│ /data/aaabbb/projects/OLMo/olmo/util.py:649 in _http_get_bytes_range │
│ │
│ 646 │ ) │
│ 647 │ result = response.content │
│ 648 │ assert ( │
│ ❱ 649 │ │ len(result) == num_bytes │
│ 650 │ ), f"expected {num_bytes} bytes, got {len(result)}" # Some web servers silently ignor │
│ 651 │ return result │
│ 652 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AssertionError: expected 4096 bytes, got 175
The above exception was the direct cause of the following exception:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/aaabbb/projects/OLMo/scripts/train.py:345 in <module> │
│ │
│ 342 │ │ raise OLMoCliError(f"Usage: {sys.argv[0]} [CONFIG_PATH] [OPTIONS]") │
│ 343 │ │
│ 344 │ cfg = TrainConfig.load(yaml_path, [clean_opt(s) for s in args_list]) │
│ ❱ 345 │ main(cfg) │
│ 346 │
│ │
│ /data/aaabbb/projects/OLMo/scripts/train.py:317 in main │
│ │
│ 314 │ │ │
│ 315 │ │ if not cfg.dry_run: │
│ 316 │ │ │ log.info("Starting training...") │
│ ❱ 317 │ │ │ trainer.fit() │
│ 318 │ │ │ log.info("Training complete") │
│ 319 │ │ else: │
│ 320 │ │ │ log.info("Dry run complete") │
│ │
│ /data/aaabbb/projects/OLMo/olmo/train.py:1181 in fit │
│ │
│ 1178 │ │ │
│ 1179 │ │ with torch_profiler as p: │
│ 1180 │ │ │ for epoch in range(self.epoch or 0, self.max_epochs): │
│ ❱ 1181 │ │ │ │ for batch in self.train_loader: │
│ 1182 │ │ │ │ │ # Bookkeeping. │
│ 1183 │ │ │ │ │ # NOTE: To track the global batch size / number of tokens per batch w │
│ 1184 │ │ │ │ │ # batches see the same number of tokens, which should be the case for │
│ │
│ /home/tqgpt/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/utils/data/dataload │
│ │
│ 628 │ │ │ if self._sampler_iter is None: │
│ 629 │ │ │ │ # TODO(https://github.com/pytorch/pytorch/issues/76750) │
│ 630 │ │ │ │ self._reset() # type: ignore[call-arg] │
│ ❱ 631 │ │ │ data = self._next_data() │
│ 632 │ │ │ self._num_yielded += 1 │
│ 633 │ │ │ if self._dataset_kind == _DatasetKind.Iterable and \ │
│ 634 │ │ │ │ │ self._IterableDataset_len_called is not None and \ │
│ │
│ /home/tqgpt/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/utils/data/dataload │
│ │
│ 672 │ │
│ 673 │ def _next_data(self): │
│ 674 │ │ index = self._next_index() # may raise StopIteration │
│ ❱ 675 │ │ data = self._dataset_fetcher.fetch(index) # may raise StopIteration │
│ 676 │ │ if self._pin_memory: │
│ 677 │ │ │ data = _utils.pin_memory.pin_memory(data, self._pin_memory_device) │
│ 678 │ │ return data │
│ │
│ /home/tqgpt/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/utils/data/_utils/f │
│ │
│ 29 │ │ │ data = [] │
│ 30 │ │ │ for _ in possibly_batched_index: │
│ 31 │ │ │ │ try: │
│ ❱ 32 │ │ │ │ │ data.append(next(self.dataset_iter)) │
│ 33 │ │ │ │ except StopIteration: │
│ 34 │ │ │ │ │ self.ended = True │
│ 35 │ │ │ │ │ break │
│ │
│ /data/aaabbb/projects/OLMo/olmo/data/iterable_dataset.py:179 in <genexpr> │
│ │
│ 176 │ │ │ │ │ threaded_generator(generator, maxsize=queue_size, thread_name=f"data t │
│ 177 │ │ │ │ ) │
│ 178 │ │ │ │
│ ❱ 179 │ │ │ return (x for x in roundrobin(*thread_generators)) │
│ 180 │ │ else: │
│ 181 │ │ │ return (self._get_dataset_item(int(idx)) for idx in indices) │
│ 182 │
│ │
│ /data/aaabbb/projects/OLMo/olmo/util.py:738 in roundrobin │
│ │
│ 735 │ while num_active: │
│ 736 │ │ try: │
│ 737 │ │ │ for next in nexts: │
│ ❱ 738 │ │ │ │ yield next() │
│ 739 │ │ except StopIteration: │
│ 740 │ │ │ # Remove the iterator we just exhausted from the cycle. │
│ 741 │ │ │ num_active -= 1 │
│ │
│ /data/aaabbb/projects/OLMo/olmo/util.py:722 in threaded_generator │
│ │
│ 719 │ │
│ 720 │ for x in iter(q.get, sentinel): │
│ 721 │ │ if isinstance(x, Exception): │
│ ❱ 722 │ │ │ raise OLMoThreadError(f"generator thread {thread_name} failed") from x │
│ 723 │ │ else: │
│ 724 │ │ │ yield x │
│ 725 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OLMoThreadError: generator thread data thread 0 failed
W0818 12:33:57.741000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227462 closing signal SIGTERM
W0818 12:33:57.741000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227463 closing signal SIGTERM
W0818 12:33:57.742000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227464 closing signal SIGTERM
W0818 12:33:57.742000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227465 closing signal SIGTERM
W0818 12:33:57.743000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227466 closing signal SIGTERM
W0818 12:33:57.743000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227468 closing signal SIGTERM
W0818 12:33:57.744000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227469 closing signal SIGTERM
/home/tqgpt/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
E0818 12:33:59.086000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 5 (pid: 1227467) of binary: /home/cde/anaconda3/envs/env_olmo_py311/bin/python
Traceback (most recent call last):
File "/home/cde/anaconda3/envs/env_olmo_py311/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-08-18_12:33:57
host : ubuntu
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 1227467)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(env_olmo_py311) cde@ubuntu:/data/aaabbb/projects/OLMo$
❓ The question
I use the default config
configs/official/OLMo-1B.yaml
and remove the wandb config. Then training the model at 8*A800. And run the cmdtorchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml
for training.There is no error for the training at the starting steps. But will pop up the error message after few steps of training after ~10 minutes.