Closed k21993 closed 12 months ago
I have same issue on 48GB GPU - am following to see what the solution is.
tldr you can can force the strategy to be deepspeed and it should run. the default ds config is stage 2 which is effective even on a single gpu.
@griff4692 Thanks for the pointer, I hardcoded strategy as strategy = DeepSpeedStrategy(config=ds_config)
here and it runs! Although there are two issues that I see:
Do you know why this is the case?
The devices constant in Lora.py is set to 1. You could try changing it and see what happens
Aah I didn't realize it was hardcoded there, thanks!
The devices constant in Lora.py is set to 1. You could try changing it and see what happens
@awaelchli @lantiga maybe show a warning if more devices are available?
@k21993 LoRA with Falcon 7B should work on a single GPU with ~16 Gb. If not, you can change the micro_batch_size = 4
to micro_batch_size = 1
(it only affects the runtime) or try to reduce the LoRA rank.
what else did you change? even I change micro_batch_size = 4 to micro_batch_size = 1, LoRA with Falcon 7B does not work on a single GPU with 24 GB.
That's weird, here are the complete settings I used https://github.com/rasbt/LLM-finetuning-scripts/blob/main/lit-benchmarks/falcon-7b/finetune/lora.py
via
python finetune/lora.py --checkpoint_dir checkpoints/tiiuae/falcon-7b/
the peak memory use was 16.97 according to
print(f"Memory used: {torch.cuda.max_memory_reserved() / 1e9:.02f} GB", file=sys.stderr)
@rasbt @aniketmaurya
I tried running with 8 A100 (80GB) GPUs with the settings:
batch_size = 64
micro_batch_size = 4
lora_r = 8
devices=8
It runs for ~15k iterations and eventually fails with:
OutOfMemoryError: CUDA out of memory. Tried to allocate 632.00 MiB. GPU 0 has a total capacty of 79.15 GiB of which 452.44 MiB is free. Process 147633 has 32.01 GiB memory in
use. Including non-PyTorch memory, this process has 46.70 GiB memory in use. Of the allocated memory 42.63 GiB is allocated by PyTorch, and 1.05 GiB is reserved by PyTorch but
unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and
PYTORCH_CUDA_ALLOC_CONF
If I set devices=1
and run
python finetune/lora.py --checkpoint_dir checkpoints/tiiuae/falcon-7b/
It fails on startup itself.
If I set devices=1
and hardcode strategy=deepspeed
, it still uses a lot of memory:
Regarding the 1 GPU setting you have above, you mention micro_batch_size = 4
. So if you set this to micro_batch_size = 1
, then theoretically it should work: 67,775 Mib / 4 = 16,943 Mib
Regarding multi-GPU training, it is currently set to deep speed stage 2, which is not very memory efficient (it optimizes for speed). If you set this to deepspeed stage 3, it is more memory-efficient, but there is currently a bug with stage 3 & multi-GPU (#161). But the 1 GPU case should definitely work.
I have a fix in #171 that will reduce the memory requirements for fine-tuning and training
@carmocca Seems like this is a fix for the adapter
method but not lora
based on the PR. Can you outline the basic steps to make these changes for lora
?
@k21993 the fix above also applies to lora
Hey @carmocca I tried your fix and the memory requirement seems to be the same while the iteration time decreases from ~10s to ~7s.
Here's my config:
max_seq_len = 2048
micro_batch_size = 2
batch_size = 64
lora_r = 64
lora_alpha = 128
devices = 1
ds_config = {
"train_micro_batch_size_per_gpu": micro_batch_size,
"gradient_accumulation_steps": gradient_accumulation_iters,
"zero_optimization": {"stage": 2},
}
The memory occupied is the same (~73 GB)
I did not do a deep analysis but here is what helped in my case (now mem consumption is constant at ~ 16GB with micro_batch_size of 1): First I removed the SpeedMonitor because for some reason this needed lots of memory. Second I have seen that over the training time more and more memory was consumed -- I now call torch.cuda.empty_cache()
every n iterations and now the mem consumption is constant over time too.
I'm currently following the instructions for fine tuning Falcon 7B with adapter V2 and ran into similar issues. I deleted the following lines in train
:
if not isinstance(fabric.strategy, DeepSpeedStrategy): # unsupported
measured_flops = measure_flops(
model, torch.randint(0, 1, (micro_batch_size, model.config.block_size), device=fabric.device)
)
fabric.print(f"Measured TFLOPs: {measured_flops * fabric.world_size / 1e12:.2f}")
else:
measured_flops = None
and just replaced them with measured_flops = None
. That seemed to fix everything for me on an NVIDIA RTX A6000 (48GB). That might be why setting the strategy to deepspeed seems to fix things.
I'm currently following the instructions for fine tuning Falcon 7B with adapter V2 and ran into similar issues. I deleted the following lines in
train
:if not isinstance(fabric.strategy, DeepSpeedStrategy): # unsupported measured_flops = measure_flops( model, torch.randint(0, 1, (micro_batch_size, model.config.block_size), device=fabric.device) ) fabric.print(f"Measured TFLOPs: {measured_flops * fabric.world_size / 1e12:.2f}") else: measured_flops = None
and just replaced them with
measured_flops = None
. That seemed to fix everything for me on an NVIDIA RTX A6000 (48GB). That might be why setting the strategy to deepspeed seems to fix things.
I lied, I still ran into an OOM issue about 80 steps in after fixing a NaN problem (solved by using --precision bf16-mixed
).
I've tried using adapter_v2.py
, adapter.py
and lora.py
. All quickly OOM on my 48GB GPU (within 80 steps). Not sure what's causing this yet.
EDIT: With some tweaking, changing these settings got me a few more steps (up to about 600) before OOM:
batch_size = 64 / devices
micro_batch_size = 1
broadly, it'd be nice if the scripts referenced in the guide worked as reported. Even with all these tweaks the minimum vram usage i'm seeing when training starts is ~30GB, not 16GB.
@fozziethebeat What's your micro_batch_size
and max_seq_len
? Since the sequence length is local to the batch, may be it finds a batch later in your training that is big enough to cause OOM.
i'm using the default max_seq_length
as generated by running
python scripts/prepare_alpaca.py --checkpoint_dir checkpoints/tiiuae/falcon-7b/
Looking at the config directly, looks like it's 1079. That doesn't seem to extreme to me and is lower than the block size (2048) reported by falcon-7b.
So I'm having the same issue -- memory consumption is constant in general but after about 50 steps an OOM is raised. I logged the sequence length and in my case its definitely because of the sequence length (thanks for the hint @k21993) -- it happens exactly after the "1079 sample" occurs. All other samples are <= 650 until this point and exactly after this batch an OOM is raised -- which is fine IMO...
Update: When I restrict the token length it trains without OOMs :) Still its worth mentioning that I use a 3090 GPU so I have only 24GB of ram.
I merged #173, that should fix the FLOPs counter issue.
I'll try replicating the sequence length issues you are seeing now
So I'm having the same issue -- memory consumption is constant in general but after about 50 steps an OOM is raised. I logged the sequence length and in my case its definitely because of the sequence length (thanks for the hint @k21993) -- it happens exactly after the "1079 sample" occurs. All other samples are <= 650 until this point and exactly after this batch an OOM is raised -- which is fine IMO...
Update: When I restrict the token length it trains without OOMs :) Still its worth mentioning that I use a 3090 GPU so I have only 24GB of ram.
Noticing the same thing on my end. Specifically iter 251 gets a token length around 600 and crashes on my 3090. I modified the script to skip any inputs above 600 and it trains a little longer but crashes later on around a 500 token input. It appears the memory usage slowly creeps up over a few minutes while training, maybe something is not being released correctly.
Hey all. Using current main, here's what I'm calling:
python finetune/adapter.py --checkpoint_dir checkpoints/tiiuae/falcon-7b --precision bf16-true
with micro_batch_size=1
I get a constant ~16GB use. It might seem to slowly creep up, but that is just the CUDA allocator keeping more than it needs. As https://github.com/Lightning-AI/lit-parrot/issues/159#issuecomment-1598193614 mentioned, empty_cache()
will keep it down, but beware because that will slow it down a lot, so don't call it often if you need it.
In terms of model requirements, here's what you expect
Number of trainable parameters: 1365330
Number of non trainable parameters: 7217189760
Sum: 7218555090
Model weights fp32: 7218555090 * 4 / 1e9 = 28.87 GB
AdamW fp32: 2 * 4 * 1365330 / 1e9 = 0.01 GB
Which matches the observed 29.02 GB returned by torch.cuda.memory_reserved()
and --precision bf16-mixed
. Using 16-true
or bf16-true
, the memory is halved.
All is working as expected so far. Now, if I force all inputs to be of the maximum sequence length for the alpaca dataset (1079), the max memory reserved does jump to 24.5 GB.
I'll open a PR trying to alleviate that jump, as it's caused by an autograd issue with backward
. However, you might still need to tweak the max_seq_length
depending on your available GPU memory
Thank you! This so far seems to be the needed fix.
Trying now at main and this so far is working really smoothly. Using the exact command you tried, I'm seeing ~29GB VRAM usage and no NaNs in my loss function. So far at step 600 and no issues.
I do see small memory increases but it's much less dramatic than before.
EDIT: posted too soon. Hit an OOM after iter 1599 step 100
I merged #178 which should be a small decrease in memory usage.
I'll also be adding #182 which includes a change so that the longest alpaca sequence is loaded first, so that OOM happens at the beginning.
For the deepspeed issues, I'll be replacing it with FSDP in #118
Closing this issue. Feel free to open new ones for any new issues. Thank you all
Should this be staying under 48GB VRAM usage when we run the command below at head?
python finetune/adapter.py \
--data_dir data/alpaca \
--checkpoint_dir checkpoints/tiiuae/falcon-7b \
--out_dir out/adapter/alpaca --precision bf16-true
I've just tried this out and I still see a OOM at iter 1599 step 100.
Should this be staying under 48GB VRAM usage when we run the command below at head?
python finetune/adapter.py \ --data_dir data/alpaca \ --checkpoint_dir checkpoints/tiiuae/falcon-7b \ --out_dir out/adapter/alpaca --precision bf16-true
I've just tried this out and I still see a OOM at iter 1599 step 100.
Trying now on A6000 and it looks like I am basically maxed out on ~48GB right from the start. So possible it moves a bit up/down from there and gets OOM.
That's exactly what I noticed. It started at 100% VRAM usage and then something at iter 1599 step 100 kills it with the tiniest increase of memory.
@cipher982 @fozziethebeat, With the latest changes, you should get a maximum usage of 24.5 GB with true half precision and micro_batch_size=1
at the beginning of training
I was running into the same OOM errors even after yesterday's merge, when using LoRA with Falcon-7B-Instruct for finetuning. Going through the comments in this thread, tried removing the speed monitor code from lora.py and that helped constrain the memory issues on a single GPU for now. Also true half-precision.
EDIT: Although this worked with this config, it's still borderline in terms of GPU memory still, a small spike is enough to error out with OOM.
tried removing the speed monitor code from lora.py and that helped constrain the memory issues
The speed monitor shouldn't impact memory usage at all. Do you have a way to show that this is the case? It would be considered a bug if so
I am trying to reproduce the Falcon-7B Lora fine-tuning on the Alpaca dataset. I followed the steps to convert the checkpoints to lightning format, downloaded and tokenized the Alpaca dataset as instructed. When I run:
python finetune/lora.py --checkpoint_dir checkpoints/tiiuae/falcon-7b/
I get the following traceback:
It is also using just 1 GPU and not 8 that I have. Please help me resolve these issues ASAP. Thanks!