Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
9.66k stars 963 forks source link

QLoRA / bnb.nf4 quantization causes issues in recent PyTorch Lightning/Fabric versions #1604

Closed rasbt closed 1 month ago

rasbt commented 1 month ago

Bug description

Either I'm doing something dumb or QLoRA seems to be broken. Tried it with different models:

LoRA (fine)

gemma_2 ~/litgpt litgpt finetune_lora --devices 1 --config config_hub/finetune/gemma-2b/lora.yaml       
{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/google/gemma-2b'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.03847,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x7fae9a9a2140>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=True),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.1,
 'lora_head': True,
 'lora_key': True,
 'lora_mlp': True,
 'lora_projection': True,
 'lora_query': True,
 'lora_r': 8,
 'lora_value': True,
 'num_nodes': 1,
 'optimizer': {'class_path': 'torch.optim.AdamW',
               'init_args': {'betas': [0.9, 0.95],
                             'lr': 0.0002,
                             'weight_decay': 0.0}},
 'out_dir': PosixPath('out/finetune/lora-gemma-2b'),
 'precision': 'bf16-true',
 'quantize': None,
 'seed': 1337,
 'train': TrainArgs(save_interval=800,
                    log_interval=1,
                    global_batch_size=6,
                    micro_batch_size=2,
                    lr_warmup_steps=200,
                    lr_warmup_fraction=None,
                    epochs=2,
                    max_tokens=None,
                    max_steps=None,
                    max_seq_length=512,
                    tie_embeddings=None,
                    max_norm=None,
                    min_lr=6e-05)}
Seed set to 1337
Number of trainable parameters: 11,870,208
Number of non-trainable parameters: 3,030,460,416
The longest sequence length in the train data is 512, the model's maximum sequence length is 512 and context length is 4096
Verifying settings ...
Missing logger folder: /teamspace/studios/this_studio/out/finetune/lora-gemma-2b/logs/csv
Epoch 1 | iter 1 step 0 | loss train: 115.482, val: n/a | iter time: 753.85 ms
Epoch 1 | iter 2 step 0 | loss train: 106.427, val: n/a | iter time: 381.31 ms
Epoch 1 | iter 3 step 1 | loss train: 101.139, val: n/a | iter time: 351.09 ms (step)
Epoch 1 | iter 4 step 1 | loss train: 95.109, val: n/a | iter time: 167.29 ms
Epoch 1 | iter 5 step 1 | loss train: 98.440, val: n/a | iter time: 121.49 ms
Epoch 1 | iter 6 step 2 | loss train: 104.927, val: n/a | iter time: 182.25 ms (step)

QLoRA from config file (not fine)

gemma_2 ~/litgpt litgpt finetune_lora --devices 1 --config config_hub/finetune/gemma-2b/qlora.yaml 
{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/google/gemma-2b'),
 'data': Alpaca2k(mask_prompt=False,
                  val_split_fraction=0.03847,
                  prompt_style=<litgpt.prompts.Alpaca object at 0x7f4ae444efb0>,
                  ignore_index=-100,
                  seed=42,
                  num_workers=4,
                  download_dir=PosixPath('data/alpaca2k')),
 'devices': 1,
 'eval': EvalArgs(interval=25,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=True),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.1,
 'lora_head': True,
 'lora_key': True,
 'lora_mlp': True,
 'lora_projection': True,
 'lora_query': True,
 'lora_r': 16,
 'lora_value': True,
 'num_nodes': 1,
 'optimizer': {'class_path': 'torch.optim.AdamW',
               'init_args': {'betas': [0.9, 0.95],
                             'lr': 0.0002,
                             'weight_decay': 0.0}},
 'out_dir': PosixPath('out/finetune/qlora-gemma-2b'),
 'precision': 'bf16-true',
 'quantize': 'bnb.nf4',
 'seed': 1337,
 'train': TrainArgs(save_interval=800,
                    log_interval=1,
                    global_batch_size=6,
                    micro_batch_size=2,
                    lr_warmup_steps=200,
                    lr_warmup_fraction=None,
                    epochs=2,
                    max_tokens=None,
                    max_steps=None,
                    max_seq_length=512,
                    tie_embeddings=None,
                    max_norm=None,
                    min_lr=6e-05)}
Seed set to 1337
Number of trainable parameters: 23,740,416
Number of non-trainable parameters: 3,030,460,416
Traceback (most recent call last):
  File "/home/zeus/miniconda3/envs/cloudspace/bin/litgpt", line 8, in <module>
    sys.exit(main())
  File "/teamspace/studios/this_studio/litgpt/litgpt/__main__.py", line 71, in main
    CLI(parser_data)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 119, in CLI
    return _run_component(component, init.get(subcommand))
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 204, in _run_component
    return component(**cfg)
  File "/teamspace/studios/this_studio/litgpt/litgpt/finetune/lora.py", line 169, in setup
    fabric.launch(main, devices, seed, config, data, checkpoint_dir, out_dir, train, eval, optimizer)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 845, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 931, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 936, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "/teamspace/studios/this_studio/litgpt/litgpt/finetune/lora.py", line 215, in main
    load_checkpoint(fabric, model, checkpoint_path, strict=False)
  File "/teamspace/studios/this_studio/litgpt/litgpt/utils.py", line 362, in load_checkpoint
    model.load_state_dict(state_dict, strict=strict)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 168, in load_state_dict
    return self._original_module.load_state_dict(state_dict=state_dict, strict=strict, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2139, in load_state_dict
    load(self, state_dict)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2121, in load
    module._load_from_state_dict(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1991, in _load_from_state_dict
    hook(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 72, in __call__
    return self.hook(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/plugins/precision/bitsandbytes.py", line 166, in _quantize_on_load_hook
    quantize_fn(weight)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/plugins/precision/bitsandbytes.py", line 320, in quantize_
    if weight.data.dtype == torch.uint8:
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/utilities/load.py", line 166, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: '_NotYetLoadedTensor' object has no attribute 'data'

QLoRA without config file

gemma_2 ~/litgpt litgpt finetune_lora checkpoints/google/gemma-2b  --devices 1 --quantize bnb.nf4 --precision bf16-true
{'access_token': None,
 'checkpoint_dir': PosixPath('checkpoints/google/gemma-2b'),
 'data': None,
 'devices': 1,
 'eval': EvalArgs(interval=100,
                  max_new_tokens=100,
                  max_iters=100,
                  initial_validation=False,
                  final_validation=True),
 'logger_name': 'csv',
 'lora_alpha': 16,
 'lora_dropout': 0.05,
 'lora_head': False,
 'lora_key': False,
 'lora_mlp': False,
 'lora_projection': False,
 'lora_query': True,
 'lora_r': 8,
 'lora_value': True,
 'num_nodes': 1,
 'optimizer': 'AdamW',
 'out_dir': PosixPath('out/finetune/lora'),
 'precision': 'bf16-true',
 'quantize': 'bnb.nf4',
 'seed': 1337,
 'train': TrainArgs(save_interval=1000,
                    log_interval=1,
                    global_batch_size=16,
                    micro_batch_size=1,
                    lr_warmup_steps=100,
                    lr_warmup_fraction=None,
                    epochs=5,
                    max_tokens=None,
                    max_steps=None,
                    max_seq_length=None,
                    tie_embeddings=None,
                    max_norm=None,
                    min_lr=6e-05)}
Seed set to 1337
Number of trainable parameters: 921,600
Number of non-trainable parameters: 3,030,460,416
Traceback (most recent call last):
  File "/home/zeus/miniconda3/envs/cloudspace/bin/litgpt", line 8, in <module>
    sys.exit(main())
  File "/teamspace/studios/this_studio/litgpt/litgpt/__main__.py", line 71, in main
    CLI(parser_data)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 119, in CLI
    return _run_component(component, init.get(subcommand))
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 204, in _run_component
    return component(**cfg)
  File "/teamspace/studios/this_studio/litgpt/litgpt/finetune/lora.py", line 169, in setup
    fabric.launch(main, devices, seed, config, data, checkpoint_dir, out_dir, train, eval, optimizer)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 845, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 931, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 936, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "/teamspace/studios/this_studio/litgpt/litgpt/finetune/lora.py", line 215, in main
    load_checkpoint(fabric, model, checkpoint_path, strict=False)
  File "/teamspace/studios/this_studio/litgpt/litgpt/utils.py", line 362, in load_checkpoint
    model.load_state_dict(state_dict, strict=strict)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 168, in load_state_dict
    return self._original_module.load_state_dict(state_dict=state_dict, strict=strict, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2139, in load_state_dict
    load(self, state_dict)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2121, in load
    module._load_from_state_dict(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1991, in _load_from_state_dict
    hook(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 72, in __call__
    return self.hook(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/plugins/precision/bitsandbytes.py", line 166, in _quantize_on_load_hook
    quantize_fn(weight)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/plugins/precision/bitsandbytes.py", line 320, in quantize_
    if weight.data.dtype == torch.uint8:
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/utilities/load.py", line 166, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: '_NotYetLoadedTensor' object has no attribute 'data'

What operating system are you using?

Unknown

LitGPT Version

litgpt 0.4.5 (Gemma 2 branch)
rasbt commented 1 month ago

Not related to the Gemma 2 branch, also occurs in main.

rasbt commented 1 month ago

Doesn't seem to be related to bitsandbytes and lightning fabric versions (issue also occurs with bnb 0.41.3 and lightning 0.2.2). Maybe something in LitGPT has changed.

Andrei-Aksionov commented 1 month ago

Not only QLoRA. I tried to simply generate/chat in a new studio, fresh venv, code from master, pythia-1b model. The same error if quantization is applied.

rasbt commented 1 month ago

I am not sure what's changed that could be causing this, we have bitsandbytes and lightning/fabric pinned.

Andrei-Aksionov commented 1 month ago

It's caused by PyTorch-Lightning. Try:

pip install lightning==2.3.0.dev20240428 

which is the package that the repo used before.

Andrei-Aksionov commented 1 month ago

This kind of issues needs to be caught by tests.

rasbt commented 1 month ago

Ohhh, so basically #1579. We can revert to an older version, but the question is whether there's something that needs to be updated in PyTorch-Lightning (in case this was an accidental change) or LitGPT (so that we can support newer PTL versions moving forward). Would appreciate your thoughts here @awaelchli

rasbt commented 1 month ago

Added a quick PR to add a test and revert the lightning version until we have more time to investigate #1605

awaelchli commented 1 month ago

It's not really fixed. Downgrading the version is possible to avoid the problem, but isn't it conceivable that at some point LitGPT might want to support newer versions of Lightning? What happens then?

I think in such situations at least we should open a ticket on the library in question (lightning in this case). Plus the stack trace hints at bitsandbytes being involved, so we'd also need to collect the bnb version used. These are all essential steps that would help us resolve these issues efficiently.

rasbt commented 1 month ago

Yes, I just realized this too and reopened a few seconds before you posted. Let me prepare an issue for the PyTorch Lightning issue tracker.

rasbt commented 1 month ago

See issue: https://github.com/Lightning-AI/pytorch-lightning/issues/20119

awaelchli commented 1 month ago

With the fix https://github.com/Lightning-AI/pytorch-lightning/pull/20121 you can try updating the lightning package to the nightly produced next Sunday or once the next regular release is done.

rasbt commented 1 month ago

Sounds great, thanks. I will make a reminder to test this on Sunday/Monday!