allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.5k stars 453 forks source link

OLMoThreadError: generator thread data thread 0 failed #706

Open ybdesire opened 1 month ago

ybdesire commented 1 month ago

❓ The question

I use the default config configs/official/OLMo-1B.yaml and remove the wandb config. Then training the model at 8*A800. And run the cmd torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml for training.

There is no error for the training at the starting steps. But will pop up the error message after few steps of training after ~10 minutes.

[2024-08-18 12:26:53] INFO     [olmo.train:966, rank=0] [step=1/739328,epoch=0]
    optim/total_grad_norm=9.355
    train/CrossEntropyLoss=11.35
    train/Perplexity=84,618
    throughput/total_tokens=4,194,304
    throughput/total_training_Gflops=6,215,264
    throughput/total_training_log_Gflops=15.64
    System/Peak GPU Memory (MB)=42,083
[2024-08-18 12:28:06] INFO     [olmo.train:966, rank=0] [step=2/739328,epoch=0]
    optim/total_grad_norm=59.20
    train/CrossEntropyLoss=10.57
    train/Perplexity=38,880
    throughput/total_tokens=8,388,608
    throughput/total_training_Gflops=12,430,528
    throughput/total_training_log_Gflops=16.34
    throughput/device/tokens_per_second=20,592
    throughput/device/batches_per_second=0.0393
    System/Peak GPU Memory (MB)=43,259
[2024-08-18 12:29:21] INFO     [olmo.train:966, rank=0] [step=3/739328,epoch=0]
    optim/total_grad_norm=28.37
    train/CrossEntropyLoss=10.70
    train/Perplexity=44,302
    throughput/total_tokens=12,582,912
    throughput/total_training_Gflops=18,645,793
    throughput/total_training_log_Gflops=16.74
    throughput/device/tokens_per_second=10,506
    throughput/device/batches_per_second=0.0200
[2024-08-18 12:30:37] INFO     [olmo.train:966, rank=0] [step=4/739328,epoch=0]
    optim/total_grad_norm=9.530
    train/CrossEntropyLoss=11.06
    train/Perplexity=63,518
    throughput/total_tokens=16,777,216
    throughput/total_training_Gflops=24,861,057
    throughput/total_training_log_Gflops=17.03
    throughput/device/tokens_per_second=8,949
    throughput/device/batches_per_second=0.0171
[2024-08-18 12:31:54] INFO     [olmo.train:966, rank=0] [step=5/739328,epoch=0]
    optim/total_grad_norm=17.78
    train/CrossEntropyLoss=10.69
    train/Perplexity=44,126
    throughput/total_tokens=20,971,520
    throughput/total_training_Gflops=31,076,321
    throughput/total_training_log_Gflops=17.25
    throughput/device/tokens_per_second=8,271
    throughput/device/batches_per_second=0.0158
[2024-08-18 12:33:13] INFO     [olmo.train:966, rank=0] [step=6/739328,epoch=0]
    optim/total_grad_norm=7.477
    train/CrossEntropyLoss=10.23
    train/Perplexity=27,666
    throughput/total_tokens=25,165,824
    throughput/total_training_Gflops=37,291,586
    throughput/total_training_log_Gflops=17.43
    throughput/device/tokens_per_second=7,890
    throughput/device/batches_per_second=0.0151
[2024-08-18 12:33:54] CRITICAL [olmo.util:163, rank=5] Uncaught OLMoThreadError: generator thread data thread 0 failed
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/aaabbb/projects/OLMo/olmo/util.py:709 in fill_queue                                        │
│                                                                                                  │
│   706 │                                                                                          │
│   707 │   def fill_queue():                                                                      │
│   708 │   │   try:                                                                               │
│ ❱ 709 │   │   │   for value in g:                                                                │
│   710 │   │   │   │   q.put(value)                                                               │
│   711 │   │   except Exception as e:                                                             │
│   712 │   │   │   q.put(e)                                                                       │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/data/iterable_dataset.py:174 in <genexpr>                        │
│                                                                                                  │
│   171 │   │   │                                                                                  │
│   172 │   │   │   thread_generators = []                                                         │
│   173 │   │   │   for i in range(num_threads):                                                   │
│ ❱ 174 │   │   │   │   generator = (self._get_dataset_item(int(idx)) for idx in indices[i::num_th │
│   175 │   │   │   │   thread_generators.append(                                                  │
│   176 │   │   │   │   │   threaded_generator(generator, maxsize=queue_size, thread_name=f"data t │
│   177 │   │   │   │   )                                                                          │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/data/iterable_dataset.py:184 in _get_dataset_item                │
│                                                                                                  │
│   181 │   │   │   return (self._get_dataset_item(int(idx)) for idx in indices)                   │
│   182 │                                                                                          │
│   183 │   def _get_dataset_item(self, idx: int) -> Dict[str, Any]:                               │
│ ❱ 184 │   │   item = self.dataset[idx]                                                           │
│   185 │   │   if isinstance(item, dict):                                                         │
│   186 │   │   │   return dict(**item, index=idx)                                                 │
│   187 │   │   else:                                                                              │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/data/memmap_dataset.py:196 in __getitem__                        │
│                                                                                                  │
│   193 │   │   │   raise IndexError(f"{index} is out of bounds for dataset of size {len(self)}")  │
│   194 │   │                                                                                      │
│   195 │   │   # Read the data from file.                                                         │
│ ❱ 196 │   │   input_ids = self._read_chunk_from_memmap(self._memmap_paths[memmap_index], memmap_ │
│   197 │   │   out: Dict[str, Any] = {"input_ids": input_ids}                                     │
│   198 │   │   if self.instance_filter_config is not None:                                        │
│   199 │   │   │   out["instance_mask"] = self._validate_instance(input_ids)                      │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/data/memmap_dataset.py:162 in _read_chunk_from_memmap            │
│                                                                                                  │
│   159 │   │   item_size = dtype(0).itemsize                                                      │
│   160 │   │   bytes_start = index * item_size * self._chunk_size                                 │
│   161 │   │   num_bytes = item_size * self._chunk_size                                           │
│ ❱ 162 │   │   buffer = get_bytes_range(path, bytes_start, num_bytes)                             │
│   163 │   │   array = np.frombuffer(buffer, dtype=dtype)                                         │
│   164 │   │   if dtype == np.bool_:                                                              │
│   165 │   │   │   return torch.tensor(array)                                                     │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/util.py:375 in get_bytes_range                                   │
│                                                                                                  │
│   372 │   │   │   │   parsed.scheme, parsed.netloc, parsed.path.strip("/"), bytes_start, num_byt │
│   373 │   │   │   )                                                                              │
│   374 │   │   elif parsed.scheme in ("http", "https"):                                           │
│ ❱ 375 │   │   │   return _http_get_bytes_range(                                                  │
│   376 │   │   │   │   parsed.scheme, parsed.netloc, parsed.path.strip("/"), bytes_start, num_byt │
│   377 │   │   │   )                                                                              │
│   378 │   │   elif parsed.scheme == "file":                                                      │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/util.py:649 in _http_get_bytes_range                             │
│                                                                                                  │
│   646 │   )                                                                                      │
│   647 │   result = response.content                                                              │
│   648 │   assert (                                                                               │
│ ❱ 649 │   │   len(result) == num_bytes                                                           │
│   650 │   ), f"expected {num_bytes} bytes, got {len(result)}"  # Some web servers silently ignor │
│   651 │   return result                                                                          │
│   652                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AssertionError: expected 4096 bytes, got 175

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/aaabbb/projects/OLMo/scripts/train.py:345 in <module>                                      │
│                                                                                                  │
│   342 │   │   raise OLMoCliError(f"Usage: {sys.argv[0]} [CONFIG_PATH] [OPTIONS]")                │
│   343 │                                                                                          │
│   344 │   cfg = TrainConfig.load(yaml_path, [clean_opt(s) for s in args_list])                   │
│ ❱ 345 │   main(cfg)                                                                              │
│   346                                                                                            │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/scripts/train.py:317 in main                                          │
│                                                                                                  │
│   314 │   │                                                                                      │
│   315 │   │   if not cfg.dry_run:                                                                │
│   316 │   │   │   log.info("Starting training...")                                               │
│ ❱ 317 │   │   │   trainer.fit()                                                                  │
│   318 │   │   │   log.info("Training complete")                                                  │
│   319 │   │   else:                                                                              │
│   320 │   │   │   log.info("Dry run complete")                                                   │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/train.py:1181 in fit                                             │
│                                                                                                  │
│   1178 │   │                                                                                     │
│   1179 │   │   with torch_profiler as p:                                                         │
│   1180 │   │   │   for epoch in range(self.epoch or 0, self.max_epochs):                         │
│ ❱ 1181 │   │   │   │   for batch in self.train_loader:                                           │
│   1182 │   │   │   │   │   # Bookkeeping.                                                        │
│   1183 │   │   │   │   │   # NOTE: To track the global batch size / number of tokens per batch w │
│   1184 │   │   │   │   │   # batches see the same number of tokens, which should be the case for │
│                                                                                                  │
│ /home/tqgpt/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/utils/data/dataload │
│                                                                                                  │
│    628 │   │   │   if self._sampler_iter is None:                                                │
│    629 │   │   │   │   # TODO(https://github.com/pytorch/pytorch/issues/76750)                   │
│    630 │   │   │   │   self._reset()  # type: ignore[call-arg]                                   │
│ ❱  631 │   │   │   data = self._next_data()                                                      │
│    632 │   │   │   self._num_yielded += 1                                                        │
│    633 │   │   │   if self._dataset_kind == _DatasetKind.Iterable and \                          │
│    634 │   │   │   │   │   self._IterableDataset_len_called is not None and \                    │
│                                                                                                  │
│ /home/tqgpt/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/utils/data/dataload │
│                                                                                                  │
│    672 │                                                                                         │
│    673 │   def _next_data(self):                                                                 │
│    674 │   │   index = self._next_index()  # may raise StopIteration                             │
│ ❱  675 │   │   data = self._dataset_fetcher.fetch(index)  # may raise StopIteration              │
│    676 │   │   if self._pin_memory:                                                              │
│    677 │   │   │   data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)            │
│    678 │   │   return data                                                                       │
│                                                                                                  │
│ /home/tqgpt/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/utils/data/_utils/f │
│                                                                                                  │
│   29 │   │   │   data = []                                                                       │
│   30 │   │   │   for _ in possibly_batched_index:                                                │
│   31 │   │   │   │   try:                                                                        │
│ ❱ 32 │   │   │   │   │   data.append(next(self.dataset_iter))                                    │
│   33 │   │   │   │   except StopIteration:                                                       │
│   34 │   │   │   │   │   self.ended = True                                                       │
│   35 │   │   │   │   │   break                                                                   │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/data/iterable_dataset.py:179 in <genexpr>                        │
│                                                                                                  │
│   176 │   │   │   │   │   threaded_generator(generator, maxsize=queue_size, thread_name=f"data t │
│   177 │   │   │   │   )                                                                          │
│   178 │   │   │                                                                                  │
│ ❱ 179 │   │   │   return (x for x in roundrobin(*thread_generators))                             │
│   180 │   │   else:                                                                              │
│   181 │   │   │   return (self._get_dataset_item(int(idx)) for idx in indices)                   │
│   182                                                                                            │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/util.py:738 in roundrobin                                        │
│                                                                                                  │
│   735 │   while num_active:                                                                      │
│   736 │   │   try:                                                                               │
│   737 │   │   │   for next in nexts:                                                             │
│ ❱ 738 │   │   │   │   yield next()                                                               │
│   739 │   │   except StopIteration:                                                              │
│   740 │   │   │   # Remove the iterator we just exhausted from the cycle.                        │
│   741 │   │   │   num_active -= 1                                                                │
│                                                                                                  │
│ /data/aaabbb/projects/OLMo/olmo/util.py:722 in threaded_generator                                │
│                                                                                                  │
│   719 │                                                                                          │
│   720 │   for x in iter(q.get, sentinel):                                                        │
│   721 │   │   if isinstance(x, Exception):                                                       │
│ ❱ 722 │   │   │   raise OLMoThreadError(f"generator thread {thread_name} failed") from x         │
│   723 │   │   else:                                                                              │
│   724 │   │   │   yield x                                                                        │
│   725                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OLMoThreadError: generator thread data thread 0 failed
W0818 12:33:57.741000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227462 closing signal SIGTERM
W0818 12:33:57.741000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227463 closing signal SIGTERM
W0818 12:33:57.742000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227464 closing signal SIGTERM
W0818 12:33:57.742000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227465 closing signal SIGTERM
W0818 12:33:57.743000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227466 closing signal SIGTERM
W0818 12:33:57.743000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227468 closing signal SIGTERM
W0818 12:33:57.744000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1227469 closing signal SIGTERM
/home/tqgpt/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
E0818 12:33:59.086000 139764668102464 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 5 (pid: 1227467) of binary: /home/cde/anaconda3/envs/env_olmo_py311/bin/python
Traceback (most recent call last):
  File "/home/cde/anaconda3/envs/env_olmo_py311/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cde/anaconda3/envs/env_olmo_py311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-18_12:33:57
  host      : ubuntu
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 1227467)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(env_olmo_py311) cde@ubuntu:/data/aaabbb/projects/OLMo$
NonvolatileMemory commented 1 day ago

same issue here