Open mickeysun0104 opened 2 months ago
Same issue here on a single Titan V GPU (12GB). With huggingface trainer I can comfortably fit a batch of 4 but with lightning I get OOM even with a single sample. Really not sure what the difference is...
@yznlp Did you also use deepspeed with peft with huggingface functions and trying to training in lightning training framework?
@mickeysun0104 sorry should have specified. I'm using the 7B LLaVA model with PEFT with huggingface functions in lightning training framework with precision set to 16-mixed
. I haven't tried deepspeed but will have a look thanks :)
I had the same issue with PyTorch Lightning 2.4.0
. After several trials, the DeepSpeed strategy
worked when I downgraded PyTorch Lightning to 2.1.0
. It also worked in 2.0.5
but failed in 2.2.0
.
@yznlp I'm not sure if it's the problem of deepspeed integration since I also observed the same behavior (4 processes with 2 gpus) with ddp strategy. Thanks to @ChiShiang for the testing and found the temperary solution to this issue. Won't close the issue now cuz I believe there's a root cause still need to be figured out. I'm kinda new to lightning so I don't think I can find the key difference between lightning 2.1.0
and lightning >= 2.2.0
. Kindly tag @awaelchli for further dive into the issue.(I'm not sure who's main author now)
Bug description
I was able to fine-tune a 8B LLM using Huggingface training framework with PEFT+DeepSpeed stage 2 under fp16 precision(mixed precision training). Recently I would like to change my codebase to lightning due to our team's decision. However, I was not able to get the code work due to OOM issue even the settings from both side is nearly the same. Here's the code lightning-deepspeed.zip
lightning module
```python import lightning as L import torch import os from pathlib import Path from transformers import AutoModelForCausalLM from peft import get_peft_model, LoraConfig,PeftModel from lightning.pytorch.callbacks import Callback from typing import Optional LORA_CONFIG = LoraConfig( r = 64, lora_alpha=128, target_modules=['q_proj', 'k_proj', 'v_proj'], lora_dropout=0.1, bias="none", task_type="CASUAL_LM", use_dora=False ) class BoringModule(L.LightningModule): def __init__(self, model_name: str, precision=torch.float16, peft_cfg: LoraConfig=None, token: str=None, is_deepspeed_enabled: bool=True, ): super().__init__() self.model_name = model_name self.precision = precision self.token = token self.peft_cfg = peft_cfg self.model = None self.deepspeed = is_deepspeed_enabled def configure_model(self): if self.model is not None: return self.model = AutoModelForCausalLM.from_pretrained(self.model_name, torch_dtype=torch.float16, device_map={"": torch.cuda.current_device()}, trust_remote_code=True, token=self.token ) self.model.gradient_checkpointing_enable() self.model = get_peft_model(self.model, self.peft_cfg) def configure_optimizers(self): if self.deepspeed: from deepspeed.ops.adam import FusedAdam optimizer = FusedAdam(self.model.parameters(), lr=2e-4) else: optimizer = torch.optim.AdamW(self.model.parameters(), lr=2e-4) scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=15, gamma=0.1) return [optimizer], [scheduler] def forward(self, input_ids, attention_mask, label): return self.model(input_ids=input_ids, attention_mask=attention_mask, labels=label, use_cache=False) def training_step(self, batch, batch_idx): output = self.forward(batch["input_ids"], batch["attention_mask"], batch["labels"]) loss = output.loss self.log_dict({"train_loss": loss}, on_step=True, sync_dist=True) return loss def validation_step(self, batch, batch_idx): output = self.forward(batch["input_ids"], batch["attention_mask"], batch["labels"]) loss = output.loss self.log_dict({"val_loss": loss}, on_step=True, sync_dist=True) return loss class PeftCheckpoint(Callback): def __init__(self, dirpath: Optional[str]=None, ): super().__init__() self.dirpath = dirpath self.ckpt_dir = None self.current_ckpt = {} def on_validation_start(self, trainer: L.Trainer, pl_module: L.LightningModule) -> None: current_step = trainer.global_step if current_step != 0: if not trainer.default_root_dir and not self.dirpath: output_dir = os.getcwd() elif not self.dirpath or not trainer.default_root_dir: output_dir = self.dirpath if self.dirpath else trainer.default_root_dir else: raise ValueError("Get output path from both trainer and callback, please provide the path from either one of them") self.ckpt_dir = os.path.join(output_dir, f"checkpoint-{current_step}") if not os.path.exists(self.ckpt_dir): Path(self.ckpt_dir).mkdir(parents=True, exist_ok=True) self.current_ckpt["dir"] = self.ckpt_dir def on_validation_end(self, trainer: L.Trainer, pl_module: L.LightningModule) -> None: if isinstance(pl_module.model, PeftModel) and self.ckpt_dir: pl_module.model.save_pretrained(self.ckpt_dir) ```lightning training pipeline
```python import lightning as L from transformers import DataCollatorForSeq2Seq, AutoTokenizer from pl_modules import BoringModule, LORA_CONFIG, PeftCheckpoint from datasets import load_dataset from torch.utils.data import DataLoader from lightning.pytorch.strategies import DeepSpeedStrategy def main(): model_name = "meta-llama/Meta-Llama-3-8B" token = None # load data and keep necessary columns data = load_dataset("json", data_files={"train":"./train_data.json", "val":"./val_data.json",}, split=["train[:100]", "val[:100]"]) train_data, val_data = data[0], data[1] # init pl module peft_llm = BoringModule(model_name=model_name, is_deepspeed_enabled=True, peft_cfg=LORA_CONFIG, token=token, ) tokenizer = AutoTokenizer.from_pretrained(model_name, token=token, padding_side="left", max_length=8192) # put them in the dataloaders data_collator = DataCollatorForSeq2Seq(tokenizer, pad_to_multiple_of=8, return_tensors="pt",padding=True) train_dataloader = DataLoader(train_data, batch_size=2, collate_fn=data_collator, num_workers=8) val_dataloader = DataLoader(val_data, batch_size=2, collate_fn=data_collator, num_workers=8) # init trainer and set the args peft_ckpt = PeftCheckpoint() trainer = L.Trainer(default_root_dir="./codetest", accelerator="cuda", callbacks=[peft_ckpt], log_every_n_steps=5, val_check_interval=5, devices=2, max_epochs=1, precision="16-mixed", num_sanity_val_steps=0, enable_checkpointing=True, strategy=DeepSpeedStrategy(config="./ds_config.json") ) trainer.fit(model=peft_llm, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader) if __name__ == "__main__": main() ```huggingface training pipeline
```python import torch from transformers import AutoModelForCausalLM, DataCollatorForSeq2Seq, Trainer, TrainingArguments,AutoTokenizer, HfArgumentParser from peft import get_peft_model from pl_modules import LORA_CONFIG from datasets import load_dataset MODEL = "meta-llama/Meta-Llama-3-8B" TOKEN = None def main(): parser = HfArgumentParser(TrainingArguments) training_args = parser.parse_args_into_dataclasses()[0] # load model and tokenizer model = AutoModelForCausalLM.from_pretrained(MODEL, token=TOKEN, torch_dtype=torch.float16, trust_remote_code=True, device_map={"": torch.cuda.current_device()}) if training_args.gradient_checkpointing: training_args.gradient_checkpointing_kwargs = {"use_reentrant": False} model.config.use_cache = False peft_model = get_peft_model(model, LORA_CONFIG) tokenizer = AutoTokenizer.from_pretrained(MODEL, token=TOKEN, max_length=8192, padding_side="left") # load data data = load_dataset("json", data_files={"train":"./train_data.json", "val":"./val_data.json",}, split=["train[:100]", "val[:100]"]) train_data, val_data = data[0], data[1] data_collator = DataCollatorForSeq2Seq(tokenizer, pad_to_multiple_of=8, return_tensors="pt",padding=True) # init trainer trainer = Trainer(model = peft_model, args = training_args, train_dataset = train_data, eval_dataset = val_data, data_collator = data_collator, compute_metrics=None ) trainer.train() if __name__ == "__main__": main() ```command
- lightning ```bash python pipeline.py > codetest.log 2>&1 ``` - huggingface ``` bash deepspeed --num_gpus=2 hf-pipeline.py --output_dir ./hf_codetest --num_train_epochs 1 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --label_names labels --learning_rate 2e-4 --optim adamw_torch --lr_scheduler_type constant_with_warmup --fp16 True --evaluation_strategy steps --logging_steps 10 --save_steps 10 --eval_steps 10 --gradient_checkpointing True --gradient_accumulation_steps 1 --report_to none --deepspeed ./ds_config_hf.json > hf_codetest.log 2>&1 ``` * If the code has trouble saving checkpoint, modity the trainer.py L2401 to `logs["grad_norm"] = grad_norm.item()` refer to this [issue](https://github.com/huggingface/transformers/issues/29207)I've seen some issues talking about the problem of using huggingface model in lightning framework, and I also tried some of the suggestions. however, none of them work : (
17878 -> confict about device setting
17043 ->properly load the model in configure_model hook should be alright
and some issues about using Zero 3 with hf pretrained model. I'm not putting all of them here since I'm trying to use zero 2 which should be less complicated.
The weird part I observe during lightning training is like below, the code start training with 4 processes which I have only two gpus. when I use huggingface trainer, it only start training with 2 processes which makes sense. Also, the gpu utilization is balanced
What version are you seeing the problem on?
v2.4
How to reproduce the bug
Error messages and logs
lightning log
```text /home/ubuntu/lightning-llm/.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1150: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [2024-09-24 18:24:45,506] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores HPU available: False, using: 0 HPUs initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/2 /home/ubuntu/lightning-llm/.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1150: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [2024-09-24 18:24:52,443] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) initializing deepspeed distributed: GLOBAL_RANK: 1, MEMBER: 2/2 Enabling DeepSpeed FP16. Model parameters and inputs will be cast to `float16`. /home/ubuntu/lightning-llm/.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1150: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. /home/ubuntu/lightning-llm/.venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:1150: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. current process device: 0 current process: 136236 current process: 0 current process: 0 Loading checkpoint shards: 0%| | 0/4 [00:00, ?it/s]current process device: 1 current process: 136372 current process: 1 current process: 1 Loading checkpoint shards: 0%| | 0/4 [00:00, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:01<00:04, 1.57s/it] Loading checkpoint shards: 25%|██▌ | 1/4 [00:01<00:05, 1.78s/it] Loading checkpoint shards: 50%|█████ | 2/4 [00:03<00:03, 1.56s/it] Loading checkpoint shards: 50%|█████ | 2/4 [00:03<00:03, 1.83s/it] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:04<00:01, 1.59s/it] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:05<00:01, 1.81s/it] Loading checkpoint shards: 100%|██████████| 4/4 [00:05<00:00, 1.26s/it] Loading checkpoint shards: 100%|██████████| 4/4 [00:05<00:00, 1.38s/it] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] Using /home/ubuntu/.cache/torch_extensions/py312_cu121 as PyTorch extensions root... Loading checkpoint shards: 100%|██████████| 4/4 [00:06<00:00, 1.32s/it] Loading checkpoint shards: 100%|██████████| 4/4 [00:06<00:00, 1.50s/it] Detected CUDA files, patching ldflags Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py312_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) Current model device: cuda:0 Current max memory: {0: '13522MB', 1: '13288MB'} ninja: no work to do. Loading extension module fused_adam... LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1] Using /home/ubuntu/.cache/torch_extensions/py312_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py312_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) Current model device: cuda:1 Current max memory: {0: '14330MB', 1: '13216MB'} ninja: no work to do. Loading extension module fused_adam... | Name | Type | Params | Mode -------------------------------------------- 0 | model | PeftModel | 8.1 B | train -------------------------------------------- 37.7 M Trainable params 8.0 B Non-trainable params 8.1 B Total params 32,272.040Total estimated model params size (MB) 866 Modules in train mode 454 Modules in eval mode Time to load fused_adam op: 0.11289787292480469 seconds Time to load fused_adam op: 0.11739063262939453 seconds Training: | | 0/? [00:00, ?it/s] Training: 0%| | 0/25 [00:00, ?it/s] Epoch 0: 0%| | 0/25 [00:00, ?it/s] Current model device: cuda:0 Current gpu usage: 16412909056 ==================================================Current model dtype: {torch.float16}================================================== Currently using cache: False Traceback (most recent call last): File "/home/ubuntu/lightning-llm/pipline.py", line 88, inhuggingface trainer log
```text [2024-09-24 18:45:17,208] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-24 18:45:20,256] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-09-24 18:45:20,256] [INFO] [runner.py:585:main] cmd = /home/ubuntu/lightning-llm/.venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None hf-pipeline.py --output_dir ./hf_codetest --num_train_epochs 1 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --label_names labels --learning_rate 2e-4 --optim adamw_torch --lr_scheduler_type constant_with_warmup --fp16 True --evaluation_strategy steps --logging_steps 10 --save_steps 10 --eval_steps 10 --gradient_checkpointing True --gradient_accumulation_steps 1 --report_to none --deepspeed /home/ubuntu/lightning-llm/ds_config_hf.json [2024-09-24 18:45:21,510] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-24 18:45:24,502] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]} [2024-09-24 18:45:24,502] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0 [2024-09-24 18:45:24,502] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(Environment
Current environment
``` #- PyTorch Lightning Version : 2.4.0 #- PyTorch Version : 2.2.1 #- Python version : 3.12.3 #- OS (e.g., Linux) : Ubuntu 24.04 #- CUDA/cuDNN version: 12,0 #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): pip ```More info
Harware information: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] *2