Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.93k stars 3.34k forks source link

openweb_trainer.py crashes after 6k iters #19631

Open salykova opened 9 months ago

salykova commented 9 months ago

Hi all, when running openwebtext_trainer.py with default settings, the program crashes after 6k steps with following message:

(.venv) slkv@slkv-pc:~/Projects/project_brain$ python ./src/pretrain/openwebtext_trainer.py
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Seed set to 1337
{'model_name': 'pythia-70m', 'name': 'openwebtext', 'save_interval': 1000, 'eval_interval': 1000, 'eval_iters': 100, 'log_interval': 1, 'learning_rate': 0.0006, 'batch_size': 125, 'micro_batch_size': 5, 'gradient_accumulation_steps': 25, 'max_iters': 600000, 'weight_decay': 0.1, 'beta1': 0.9, 'beta2': 0.95, 'decay_lr': True, 'warmup_iters': 2000, 'lr_decay_iters': 600000, 'min_lr': 6e-05}
Loading model with {'name': 'pythia-70m', 'hf_config': {'org': 'EleutherAI', 'name': 'pythia-70m'}, 'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 128, 'padded_vocab_size': 50304, 'n_layer': 6, 'n_head': 8, 'n_embd': 512, 'rotary_percentage': 0.25, 'parallel_residual': True, 'bias': True, 'lm_head_bias': False, 'n_query_groups': 8, 'shared_attention_norm': False, '_norm_class': 'LayerNorm', 'norm_eps': 1e-05, '_mlp_class': 'GptNeoxMLP', 'gelu_approximate': 'none', 'intermediate_size': 2048, 'rope_condense_ratio': 1, 'rope_base': 10000, 'head_size': 64, 'rope_n_elem': 16}
Time to instantiate model: 0.00 seconds.
/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py:186: .fit(ckpt_path="last") is set, but there is no last checkpoint available. No checkpoint will be loaded.
/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:639: Checkpoint directory /home/slkv/Projects/project_brain/out/openwebtext exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name   | Type | Params
--------------------------------
0 | module | GPT  | 70.4 M
--------------------------------
70.4 M    Trainable params
0         Non-trainable params
70.4 M    Total params
281.706   Total estimated model params size (MB)
Epoch 0: |                                                                                                                             | 6000/? [09:40<00:00, 10.34it/s, v_num=0, train_loss=4.810, val_loss=5.050Traceback (most recent call last):                                                                                                                                                          | 0/100 [00:00<?, ?it/s]
  File "/home/slkv/Projects/project_brain/./src/pretrain/openwebtext_trainer.py", line 234, in <module>
    CLI(main)
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/jsonargparse/_cli.py", line 96, in CLI
    return _run_component(components, cfg_init)
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/jsonargparse/_cli.py", line 181, in _run_component
    return component(**cfg)
  File "/home/slkv/Projects/project_brain/./src/pretrain/openwebtext_trainer.py", line 191, in main
    trainer.fit(model, train_dataloader, val_dataloader, ckpt_path="last")
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 988, in _run
    results = self._run_stage()
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1034, in _run_stage
    self.fit_loop.run()
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 204, in run
    self.advance()
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 360, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 139, in run
    self.on_advance_end(data_fetcher)
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 287, in on_advance_end
    self.val_loop.run()
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 135, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 410, in _evaluation_step
    call._call_callback_hooks(trainer, hook_name, output, *hook_kwargs.values())
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 208, in _call_callback_hooks
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning_utilities/core/rank_zero.py", line 43, in wrapped_fn
    return fn(*args, **kwargs)
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/callbacks/throughput_monitor.py", line 193, in on_validation_batch_end
    self._update(trainer, pl_module, batch, iter_num)
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/callbacks/throughput_monitor.py", line 146, in _update
    throughput.update(
  File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/fabric/utilities/throughput.py", line 143, in update
    raise ValueError(f"Expected lengths ({lengths}) to be greater or equal than samples ({samples})")
ValueError: Expected lengths (2048) to be greater or equal than samples (2505)
Epoch 0: |          | 6000/? [09:40<00:00, 10.33it/s, v_num=0, train_loss=4.810, val_loss=5.050]             

Openweb data was generated using prepare_openwebtext.py. Do you maybe know what causes the error? pretrain/openwebtext.py however works fine.

cc @carmocca

carmocca commented 9 months ago

I'll take a look. Thanks for the report!

fsnaix commented 7 months ago

Is this fix in any way? I'm also encountering this bug, I've tried to add padding/truncate for my context length but the highest it will go is 8000 iter.

carmocca commented 5 months ago

Transfering this to lightning since this file no longer exists, but there's still an underlying bug

tjb-tech commented 4 months ago

milestone

May I ask have you address it? Or any quick fix for pretraining on opentext?