Lightning-AI / litgpt

Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.
https://lightning.ai
Apache License 2.0
6.85k stars 726 forks source link

Cannot copy out of meta tensor; no data! #1378

Closed Gooooooogo closed 1 week ago

Gooooooogo commented 2 weeks ago

When I run litgpt finetune lora --data Alpaca

error:

{'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-Chat-v1.0'),
 'data': Alpaca(mask_prompt=False, val_split_fraction=0.03865, prompt_style=<litgpt.prompts.Alpaca object at 0x7f1976ff0d00>, ignore_index=-100, seed=42, num_workers=4, download_dir=PosixPath('data/alpaca')),
 'devices': 3,
 'eval': EvalArgs(interval=100, max_new_tokens=100, max_iters=100, initial_validation=False),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.05,
 'lora_head': False,
 'lora_key': False,
 'lora_mlp': False,
 'lora_projection': False,
 'lora_query': True,
 'lora_r': 8,
 'lora_value': True,
 'out_dir': PosixPath('out/finetune/lora'),
 'precision': None,
 'quantize': None,
 'seed': 1337,
 'train': TrainArgs(save_interval=1000, log_interval=1, global_batch_size=16, micro_batch_size=1, lr_warmup_steps=100, lr_warmup_fraction=None, epochs=1, max_tokens=None, max_steps=None, max_seq_length=None, tie_embeddings=None, learning_rate=0.0003, weight_decay=0.02, beta1=0.9, beta2=0.95, max_norm=None, min_lr=6e-05)}
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/3
{'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-Chat-v1.0'),
 'data': Alpaca(mask_prompt=False, val_split_fraction=0.03865, prompt_style=<litgpt.prompts.Alpaca object at 0x7f6b4bd8fd60>, ignore_index=-100, seed=42, num_workers=4, download_dir=PosixPath('data/alpaca')),
 'devices': 3,
 'eval': EvalArgs(interval=100, max_new_tokens=100, max_iters=100, initial_validation=False),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.05,
 'lora_head': False,
 'lora_key': False,
 'lora_mlp': False,
 'lora_projection': False,
 'lora_query': True,
 'lora_r': 8,
 'lora_value': True,
 'out_dir': PosixPath('out/finetune/lora'),
 'precision': None,
 'quantize': None,
 'seed': 1337,
 'train': TrainArgs(save_interval=1000, log_interval=1, global_batch_size=16, micro_batch_size=1, lr_warmup_steps=100, lr_warmup_fraction=None, epochs=1, max_tokens=None, max_steps=None, max_seq_length=None, tie_embeddings=None, learning_rate=0.0003, weight_decay=0.02, beta1=0.9, beta2=0.95, max_norm=None, min_lr=6e-05)}
{'checkpoint_dir': PosixPath('checkpoints/TinyLlama/TinyLlama-1.1B-Chat-v1.0'),
 'data': Alpaca(mask_prompt=False, val_split_fraction=0.03865, prompt_style=<litgpt.prompts.Alpaca object at 0x7fca4ffcbb20>, ignore_index=-100, seed=42, num_workers=4, download_dir=PosixPath('data/alpaca')),
 'devices': 3,
 'eval': EvalArgs(interval=100, max_new_tokens=100, max_iters=100, initial_validation=False),
 'logger_name': 'csv',
 'lora_alpha': 16,
...
 'precision': None,
 'quantize': None,
 'seed': 1337,
 'train': TrainArgs(save_interval=1000, log_interval=1, global_batch_size=16, micro_batch_size=1, lr_warmup_steps=100, lr_warmup_fraction=None, epochs=1, max_tokens=None, max_steps=None, max_seq_length=None, tie_embeddings=None, learning_rate=0.0003, weight_decay=0.02, beta1=0.9, beta2=0.95, max_norm=None, min_lr=6e-05)}
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?5a6f3aaa-cfd8-4716-a486-c1f66bb0ae64) or open in a [text editor](command:workbench.action.openLargeOutput?5a6f3aaa-cfd8-4716-a486-c1f66bb0ae64). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/3
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/3
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 3 processes
----------------------------------------------------------------------------------------------------

[rank: 0] Seed set to 1337
[rank: 2] Seed set to 1337
[rank: 1] Seed set to 1337
Number of trainable parameters: 1,126,400
Number of non-trainable parameters: 1,100,048,384
The longest sequence length in the train data is 1305, the model's maximum sequence length is 1305 and context length is 2048
Validating ...
Traceback (most recent call last):
  File "/home/jwan3704/litgpt-venv/bin/litgpt", line 8, in <module>
    sys.exit(main())
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/litgpt/__main__.py", line 143, in main
    fn(**kwargs)
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/litgpt/finetune/lora.py", line 144, in setup
    fabric.launch(main, devices, seed, config, data, checkpoint_dir, out_dir, train, eval)
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 845, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 931, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 936, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/litgpt/finetune/lora.py", line 197, in main
    fit(
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/litgpt/finetune/lora.py", line 259, in fit
    validate(fabric, model, val_dataloader, dataclasses.replace(eval, max_iters=2))  # sanity check
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/litgpt/finetune/lora.py", line 354, in validate
    logits = model(input_ids)
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
...
    lora = self.zero_pad(after_B) * self.scaling  # (64, 64, 256) after zero_pad (64, 64, 384)
  File "/share/home/jwan3704/litgpt-venv/lib/python3.9/site-packages/litgpt/lora.py", line 345, in zero_pad
    self._lora_ind_cache[result.device] = lora_ind = self._lora_ind.to(result.device)
NotImplementedError: Cannot copy out of meta tensor; no data!
rasbt commented 2 weeks ago

Haven't had a chance to test or try it yet, but this looks familiar @robieta re #1374:

self._lora_ind_cache[result.device] = lora_ind = self._lora_ind.to(result.device)

NotImplementedError: Cannot copy out of meta tensor; no data!

It may or may not be related. But I'm curious when you implemented #1374 have you tested in on multi-GPU?

carmocca commented 1 week ago

Should be fixed by #770