Finetune.py OOM when saving checkpoint if trained on 24GB 3090

binaryninja commented 1 year ago

I am attempting to finetune the model using the command provided in the README. I am getting CUDA OutOfMemoryError:

OutOfMemoryError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 23.69 GiB total capacity; 21.01 GiB already allocated; 77.06 MiB free; 22.23 GiB reserved in total by PyTorch) If reserved memory is >> 
allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The hardware is a 24GB 3090. It goes out of memory when saving the file, otherwise the training runs well.

To reproduce the error quickly I add --save_freq 2 to trigger the error early on eg: python3 finetune/finetune-split.py --model_path="bigcode/starcoder" --dataset_name="ArmelR/stack-exchange-instruction" --subset="data/finetune" --split="train" --size_valid_set 10000 --streaming --seq_length 256 --save_freq 2 --max_steps 1000 --batch_size 1 --input_column_name="question" --output_column_name="response"

I've reduced sequence length here but have tried other context lengths as well.

If I leave the save_freq I'll get a full training run in until the final stage and then it crashed.

Here is an example wandb training run: Example

CMDLINE: (star2) gpu@gpu:~/code/starcoder$ python3 finetune/finetune.py --model_path="bigcode/starcoder" --dataset_name="ArmelR/stack-exchange-instruction" --subset="data/finetune" --split="train" --size_valid_set 10000 --streaming --seq_length 256 --save_freq 2 --max_steps 1000 --batch_size 1 --input_column_name="question" --output_column_name="response"

bin /home/gpu/.local/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so /home/gpu/.local/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/home/gpu/miniconda3/envs/star2/lib/libcudart.so'), PosixPath('/home/gpu/miniconda3/envs/star2/lib/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward. Either way, this might cause trouble in the future: If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env. warn(msg) CUDA SETUP: CUDA runtime path found: /home/gpu/miniconda3/envs/star2/lib/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.6 CUDA SETUP: Detected CUDA version 116 CUDA SETUP: Loading binary /home/gpu/.local/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so... Loading the dataset in streaming mode 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:27<00:00, 14.72it/s] The character to token ratio of the dataset is: 3.46 Loading the model Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:30<00:00, 4.34s/it] trainable params: 35553280 || all params: 15553009664 || trainable%: 0.22859421274773536 Starting main loop Training... /home/gpu/.local/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning warnings.warn( wandb: Currently logged in as: richarjb. Use wandb login --relogin to force relogin wandb: wandb version 0.15.2 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.14.2 wandb: Run data is saved locally in /home/gpu/code/starcoder/wandb/run-20230507_134929-5jvhgep5 wandb: Run wandb offline to turn off syncing. wandb: Syncing run StarCoder-finetuned

/home/gpu/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:318: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization") /home/gpu/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:318: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization") Traceback (most recent call last): File "/home/gpu/code/starcoder/finetune/finetune.py", line 314, in main(args) File "/home/gpu/code/starcoder/finetune/finetune.py", line 303, in main run_training(args, train_dataset, eval_dataset) File "/home/gpu/code/starcoder/finetune/finetune.py", line 293, in run_training trainer.train() File "/home/gpu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train return inner_training_loop( File "/home/gpu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2006, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/home/gpu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2291, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/home/gpu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2348, in _save_checkpoint self.save_model(output_dir, _internal_call=True) File "/home/gpu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2830, in save_model self._save(output_dir) File "/home/gpu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2873, in _save state_dict = self.model.state_dict() File "/home/gpu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/home/gpu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/home/gpu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) [Previous line repeated 4 more times] File "/home/gpu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1815, in state_dict self._save_to_state_dict(destination, prefix, keep_vars) File "/home/gpu/.local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 268, in _save_to_state_dict self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices) File "/home/gpu/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 100, in undo_layout return outputs.reshape(rows, cols).contiguous() torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 23.69 GiB total capacity; 21.01 GiB already allocated; 77.06 MiB free; 22.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/gpu/code/starcoder/finetune/finetune.py:314 in │ │ │ │ 311 │ │ │ 312 │ logging.set_verbosity_error() │ │ 313 │ │ │ ❱ 314 │ main(args) │ │ 315 │ │ │ │ /home/gpu/code/starcoder/finetune/finetune.py:303 in main │ │ │ │ 300 │ │ │ 301 │ tokenizer = AutoTokenizer.from_pretrained(args.model_path, use_auth_token=token) │ │ 302 │ train_dataset, eval_dataset = create_datasets(tokenizer, args) │ │ ❱ 303 │ run_training(args, train_dataset, eval_dataset) │ │ 304 │ │ 305 │ │ 306 if name == "main": │ │ │ │ /home/gpu/code/starcoder/finetune/finetune.py:293 in run_training │ │ │ │ 290 │ trainer = Trainer(model=model, args=training_args, train_dataset=train_data, eval_da │ │ 291 │ │ │ 292 │ print("Training...") │ │ ❱ 293 │ trainer.train() │ │ 294 │ │ │ 295 │ print("Saving last checkpoint of the model") │ │ 296 │ model.save_pretrained(os.path.join(args.output_dir, "final_checkpoint/")) │ │ │ │ /home/gpu/.local/lib/python3.10/site-packages/transformers/trainer.py:1662 in train │ │ │ │ 1659 │ │ inner_training_loop = find_executable_batch_size( │ │ 1660 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │ │ 1661 │ │ ) │ │ ❱ 1662 │ │ return inner_training_loop( │ │ 1663 │ │ │ args=args, │ │ 1664 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │ │ 1665 │ │ │ trial=trial, │ │ │ │ /home/gpu/.local/lib/python3.10/site-packages/transformers/trainer.py:2006 in │ │ _inner_training_loop │ │ │ │ 2003 │ │ │ │ │ self.state.epoch = epoch + (step + 1 + steps_skipped) / steps_in_epo │ │ 2004 │ │ │ │ │ self.control = self.callback_handler.on_step_end(args, self.state, s │ │ 2005 │ │ │ │ │ │ │ ❱ 2006 │ │ │ │ │ self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_k │ │ 2007 │ │ │ │ else: │ │ 2008 │ │ │ │ │ self.control = self.callback_handler.on_substep_end(args, self.state │ │ 2009 │ │ │ │ /home/gpu/.local/lib/python3.10/site-packages/transformers/trainer.py:2291 in │ │ _maybe_log_save_evaluate │ │ │ │ 2288 │ │ │ self._report_to_hp_search(trial, self.state.global_step, metrics) │ │ 2289 │ │ │ │ 2290 │ │ if self.control.should_save: │ │ ❱ 2291 │ │ │ self._save_checkpoint(model, trial, metrics=metrics) │ │ 2292 │ │ │ self.control = self.callback_handler.on_save(self.args, self.state, self.con │ │ 2293 │ │ │ 2294 │ def _load_rng_state(self, checkpoint): │ │ │ │ /home/gpu/.local/lib/python3.10/site-packages/transformers/trainer.py:2348 in _save_checkpoint │ │ │ │ 2345 │ │ │ │ 2346 │ │ run_dir = self._get_output_dir(trial=trial) │ │ 2347 │ │ output_dir = os.path.join(run_dir, checkpoint_folder) │ │ ❱ 2348 │ │ self.save_model(output_dir, _internal_call=True) │ │ 2349 │ │ if self.deepspeed: │ │ 2350 │ │ │ # under zero3 model file itself doesn't get saved since it's bogus! Unless d │ │ 2351 │ │ │ # config stage3_gather_16bit_weights_on_model_save is True │ │ │ │ /home/gpu/.local/lib/python3.10/site-packages/transformers/trainer.py:2830 in save_model │ │ │ │ 2827 │ │ │ │ │ self.deepspeed.save_checkpoint(output_dir) │ │ 2828 │ │ │ │ 2829 │ │ elif self.args.should_save: │ │ ❱ 2830 │ │ │ self._save(output_dir) │ │ 2831 │ │ │ │ 2832 │ │ # Push to the Hub when save_model is called by the user. │ │ 2833 │ │ if self.args.push_to_hub and not _internal_call: │ │ │ │ /home/gpu/.local/lib/python3.10/site-packages/transformers/trainer.py:2873 in _save │ │ │ │ 2870 │ │ # They can then be reloaded using from_pretrained() │ │ 2871 │ │ if not isinstance(self.model, PreTrainedModel): │ │ 2872 │ │ │ if state_dict is None: │ │ ❱ 2873 │ │ │ │ state_dict = self.model.state_dict() │ │ 2874 │ │ │ │ │ 2875 │ │ │ if isinstance(unwrap_model(self.model), PreTrainedModel): │ │ 2876 │ │ │ │ unwrap_model(self.model).save_pretrained( │ │ │ │ /home/gpu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1818 in state_dict │ │ │ │ 1815 │ │ self._save_to_state_dict(destination, prefix, keep_vars) │ │ 1816 │ │ for name, module in self._modules.items(): │ │ 1817 │ │ │ if module is not None: │ │ ❱ 1818 │ │ │ │ module.state_dict(destination=destination, prefix=prefix + name + '.', k │ │ 1819 │ │ for hook in self._state_dict_hooks.values(): │ │ 1820 │ │ │ hook_result = hook(self, destination, prefix, local_metadata) │ │ 1821 │ │ │ if hook_result is not None: │ │ │ │ /home/gpu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1818 in state_dict │ │ │ │ 1815 │ │ self._save_to_state_dict(destination, prefix, keep_vars) │ │ 1816 │ │ for name, module in self._modules.items(): │ │ 1817 │ │ │ if module is not None: │ │ ❱ 1818 │ │ │ │ module.state_dict(destination=destination, prefix=prefix + name + '.', k │ │ 1819 │ │ for hook in self._state_dict_hooks.values(): │ │ 1820 │ │ │ hook_result = hook(self, destination, prefix, local_metadata) │ │ 1821 │ │ │ if hook_result is not None: │ │ │ │ /home/gpu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1818 in state_dict │ │ │ │ 1815 │ │ self._save_to_state_dict(destination, prefix, keep_vars) │ │ 1816 │ │ for name, module in self._modules.items(): │ │ 1817 │ │ │ if module is not None: │ │ ❱ 1818 │ │ │ │ module.state_dict(destination=destination, prefix=prefix + name + '.', k │ │ 1819 │ │ for hook in self._state_dict_hooks.values(): │ │ 1820 │ │ │ hook_result = hook(self, destination, prefix, local_metadata) │ │ 1821 │ │ │ if hook_result is not None: │ │ │ │ /home/gpu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1818 in state_dict │ │ │ │ 1815 │ │ self._save_to_state_dict(destination, prefix, keep_vars) │ │ 1816 │ │ for name, module in self._modules.items(): │ │ 1817 │ │ │ if module is not None: │ │ ❱ 1818 │ │ │ │ module.state_dict(destination=destination, prefix=prefix + name + '.', k │ │ 1819 │ │ for hook in self._state_dict_hooks.values(): │ │ 1820 │ │ │ hook_result = hook(self, destination, prefix, local_metadata) │ │ 1821 │ │ │ if hook_result is not None: │ │ │ │ /home/gpu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1818 in state_dict │ │ │ │ 1815 │ │ self._save_to_state_dict(destination, prefix, keep_vars) │ │ 1816 │ │ for name, module in self._modules.items(): │ │ 1817 │ │ │ if module is not None: │ │ ❱ 1818 │ │ │ │ module.state_dict(destination=destination, prefix=prefix + name + '.', k │ │ 1819 │ │ for hook in self._state_dict_hooks.values(): │ │ 1820 │ │ │ hook_result = hook(self, destination, prefix, local_metadata) │ │ 1821 │ │ │ if hook_result is not None: │ │ │ │ /home/gpu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1818 in state_dict │ │ │ │ 1815 │ │ self._save_to_state_dict(destination, prefix, keep_vars) │ │ 1816 │ │ for name, module in self._modules.items(): │ │ 1817 │ │ │ if module is not None: │ │ ❱ 1818 │ │ │ │ module.state_dict(destination=destination, prefix=prefix + name + '.', k │ │ 1819 │ │ for hook in self._state_dict_hooks.values(): │ │ 1820 │ │ │ hook_result = hook(self, destination, prefix, local_metadata) │ │ 1821 │ │ │ if hook_result is not None: │ │ │ │ /home/gpu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1818 in state_dict │ │ │ │ 1815 │ │ self._save_to_state_dict(destination, prefix, keep_vars) │ │ 1816 │ │ for name, module in self._modules.items(): │ │ 1817 │ │ │ if module is not None: │ │ ❱ 1818 │ │ │ │ module.state_dict(destination=destination, prefix=prefix + name + '.', k │ │ 1819 │ │ for hook in self._state_dict_hooks.values(): │ │ 1820 │ │ │ hook_result = hook(self, destination, prefix, local_metadata) │ │ 1821 │ │ │ if hook_result is not None: │ │ │ │ /home/gpu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1815 in state_dict │ │ │ │ 1812 │ │ if hasattr(destination, "_metadata"): │ │ 1813 │ │ │ destination._metadata[prefix[:-1]] = local_metadata │ │ 1814 │ │ │ │ ❱ 1815 │ │ self._save_to_state_dict(destination, prefix, keep_vars) │ │ 1816 │ │ for name, module in self._modules.items(): │ │ 1817 │ │ │ if module is not None: │ │ 1818 │ │ │ │ module.state_dict(destination=destination, prefix=prefix + name + '.', k │ │ │ │ /home/gpu/.local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py:268 in │ │ _save_to_state_dict │ │ │ │ 265 │ │ │ │ 266 │ │ try: │ │ 267 │ │ │ if reorder_layout: │ │ ❱ 268 │ │ │ │ self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices) │ │ 269 │ │ │ │ │ 270 │ │ │ super()._save_to_state_dict(destination, prefix, keep_vars) │ │ 271 │ │ │ │ /home/gpu/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:100 in │ │ undo_layout │ │ │ │ 97 │ outputs[tile_indices.flatten()] = tensor │ │ 98 │ outputs = outputs.reshape(tile_rows, tile_cols, cols // tile_cols, rows // tile_rows │ │ 99 │ outputs = outputs.permute(3, 0, 2, 1) # (rows // tile_rows, tile_rows), (cols // ti │ │ ❱ 100 │ return outputs.reshape(rows, cols).contiguous() │ │ 101 │ │ 102 │ │ 103 class MatMul8bit(torch.autograd.Function): │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ OutOfMemoryError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 23.69 GiB total capacity; 21.01 GiB already allocated; 77.06 MiB free; 22.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

tarunKoyalwar commented 1 year ago

@binaryninja , not sure about fine tuning but i was facing same error when loading model . I think you need to change batch size for loading model i used accelerate (https://huggingface.co/docs/accelerate/usage_guides/big_modeling)

binaryninja commented 1 year ago

I think you need to change batch size

I'm currently using --batch_size 1

tarunKoyalwar commented 1 year ago

@binaryninja , tbh documentation here is very bad i had tough time just loading the model . i think you have to explore peft for settings etc since

To fine-tune cheaply and efficiently, we use Hugging Face 🤗's PEFT

P.S : i'm trying to finetune too , will let you know if anything works

tarunKoyalwar commented 1 year ago

@binaryninja , loading model using .from_pretrained does not work me and seems to be cause of ^ .

I had to load it with accelerate and custom device_map

from accelerate import init_empty_weights
from accelerate import load_checkpoint_and_dispatch
from accelerate import infer_auto_device_map

def run_training(args, train_data, val_data):
    print("Loading the model")
    config = AutoConfig.from_pretrained("bigcode/starcoderbase")
    print(config)
    with init_empty_weights():
        model = AutoModelForCausalLM.from_config(config)
    model.tie_weights()
    print("loading and dispatching")
    my_device_map = infer_auto_device_map(model, max_memory={0: "13GiB", "cpu": "70GiB"})
    model = load_checkpoint_and_dispatch(
    model, "/home/ubuntu/.cache/huggingface/hub/models--bigcode--starcoderbase/snapshots/2417d4a7324a43db14b2a7729d17311d35dbde6e", device_map=my_device_map, no_split_module_classes=["GPTJBlock"])
    # disable caching mechanism when using gradient checkpointing
    # model = AutoModelForCausalLM.from_pretrained(
    #     args.model_path,
    #     use_auth_token=True,
    #     use_cache=not args.no_gradient_checkpointing,
    #     load_in_8bit=True,
    #     device_map={"": Accelerator().process_index},
    # )
    print("done loading")
    model = prepare_model_for_int8_training(model)
 ....redacted....

the trick seems to be custom device map

my_device_map = infer_auto_device_map(model, max_memory={0: "13GiB", "cpu": "70GiB"})

i only had 1 gpu so 0 you need to change ^ according to your platform

SivilTaram commented 1 year ago

@binaryninja For the default fine-tuning script, I think the memory required should be around 26G memory which exceeds the 24GB in your configuration. If you would like to fine-tune it on your machine, maybe integration of deepspeed is a must-do. I'm exploring it and may provide some feedback when I can succeed in training if with less than 24G memory.

rtk-jeremy-richards commented 1 year ago

There appears to be a related issue with bitsandbytes

I'll downgrade to 0.37.2 and report back.

IeatToilets commented 1 year ago

Maybe you're running out of memory while using pytorch? This error can occur in most cases if your GPU is already occupied with other processes. Or maybe the reserved memory is larger than allocated memory. Either way you can try setting max_split_size_mb to avoid fragmentation.

fh1999 commented 1 year ago

How much 3090 cards do you use to fine tune the model?

yakirba commented 1 year ago

also fails w/ Nvidia 4090 (24G but faster)

yfeng-ic commented 1 year ago

Find out the reason, use bitsandbytes=0.37.2 will work

Maomaoxion commented 1 year ago

how long it takes to finetune? stuck...

bigcode-project / starcoder

Finetune.py OOM when saving checkpoint if trained on 24GB 3090 #15