Closed MikaSie closed 5 months ago
I fixed the issue! There were somme things I did wrong:
raceback (most recent call last):
File "/workspace/Thesis/training.py", line 705, in <module>
trainer.train()
File "/workspace/Thesis/venv/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 440, in train
output = super().train(*args, **kwargs)
File "/workspace/Thesis/venv/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
return inner_training_loop(
File "/workspace/Thesis/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/workspace/Thesis/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3250, in training_step
self.accelerator.backward(loss)
File "/workspace/Thesis/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2134, in backward
loss.backward(**kwargs)
File "/workspace/Thesis/venv/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/workspace/Thesis/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
_engine_run_backward(
File "/workspace/Thesis/venv/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
So be sure not to use _get_peftmodel to wrap your model if you also pass a _peftconfig to the SFTTrainer!!
Looking back, it makes sense that my script worked with:
python3 training.py
When this was run, we were performing DP instead of FSDP. Then it also makes sense that training was 200 hours.
With my current setup I'm able to reduce training to 60-ish hours. With a _per_device_train_batchsize of 1 and _gradient_accumulationsteps of 4, the memory of my GPUs are almost maxed out. I think this is due to the long sequence length that is used.
If anyone has any recommendations on how to speed up the remainder of the training process, feel free to let me know!
Hardware: CPU: Xeon® E5-2630 v2 but limited to 16GB as this is what the vast.ai instance has. GPU: 4x A40 --> Total of 180GB
OS Linux
python 3.10
cuda 12.2
packages:
Issue
Introduction
Hi! I'm trying to fine-tune LLama3-8B on a summarization dataset of about 1500 instances. The dataset contains long documents, often over 8K tokens. I want to use FSDP + QLORA to try and finetune LLama3 8B. When following this guide I was very hopeful this was possible on my setup as I'm finetuning a 8B version instead of the 70B version.
I'm following these two guides as inspiration: bitsandbytes Guide Phil Schmid Guide
Phil Schmid's guide mentions the following: Expected Memory usage: Full-finetuning with FSDP needs ~16X80GB GPUs FSDP + LoRA needs ~8X80GB GPUs FSDP + Q-Lora needs ~2x40GB GPUs FSDP + Q-Lora + CPU offloading needs 4x24GB GPUs, with 22 GB/GPU and 127 GB CPU RAM with a sequence length of 3072 and a batch size of 1. Note: To NOT CPU offloading you need to change the value of fsdp and remove offload. This only works on > 40GB GPUs since it requires more memory.
Accelerate config setup:
Code
Start training
errors:
First is followed the guides exactly and set _fsdp_cpu_ram_efficientloading to true. But when i do this, sometimes the OS would run give a SIGKILL(9) error and stop the process: This makes sense as Phil Schmid also recommends a pretty hefty CPU memory: 127 GB CPU RAM with a sequence length of 3072 for a batch size of 1.
But oddly enough, I can run the script currently with _fsdp_cpu_ram_efficient_loading__ with either true or false and not receive the SIGKILL(9) error. However, in both situations I do get the following OOM error:
As you can see, it seems that during the backward pass, the model runs out of memory. I find this pretty odd as I (should/probably) have enough GPU memory to accomodate for the 8B FSDP and QLORA setup.
Possible limitations
CPU has too little ram. The offloading isn't possible because we only have 16GB of CPU ram. But following Phil Schmid's guide and not offloading to the CPU would suffice still, as we use 4 A40's. This is even more odd when you think that I'm using an 8B version, instead of the 70B versions that are used in both guides.
Not using Flash Attention 2 could also be an issue, but as seen in Phil Schmid's guide, SDPA can also be used.
Sequence length is too long, causing OOM. I tried setting the max_sequence_length to 512, but this didn't have any impact on the OOM issue.
Caveat
When i first dove into the rabbithole of FSDP and QLORA I started out simple and just used the following code:
I launched the code with:
This didn't result in an OOM error and I was able to train for 100 steps. This took quite long however and would become too expensive for me as the training would probably last over 200 hours.... I could see that the GPU memory was utilized pretty well and all GPU's were utilized up until 40GB or so. Because this took quite long, I wanted to use QLORA. But I couldn't just use QLORA device_map ='auto' together. That's why I resorted to FSDP in combination with QLORA.
I don't really know why using QLORA in combination with FSDP would then result in the OOM again, making me even more confused.
If you have any ideas, please let me know as I'm getting a bit frustrated after being stuck on this for a few days!