Open rustic-snob opened 7 months ago
what's your hardware specifications? did you run out of ram/vram?
what's your hardware specifications? did you run out of ram/vram?
I have A100-40GB*8
for vram and and ram is like below
total used free shared buff/cache available
Mem: 885Gi 81Gi 524Gi 9.4Gi 279Gi 787Gi
Swap: 0B 0B 0B
I monitored through, but did not ran out neither of them.
btw, is CUDA_VISIBLE_DEVICES=""
necessary in doing python -m axolotl.cli.preprocess examples/llama-2/ver2.0.yml
? I think I didn't when I preprocess
btw, is CUDA_VISIBLE_DEVICES="" necessary in doing python -m axolotl.cli.preprocess examples/llama-2/ver2.0.yml? I think I didn't when I preprocess
Shouldn't be any issue.
May I ask which model size you're running? It wasn't that clear from the yaml.
Shouldn't be any issue.
May I ask which model size you're running? It wasn't that clear from the yaml.
It is just Llama-2-7b-hf model w/ extra columns in embedding and lm_head.
I don't know why but after preprocessing w/o CUDA and train works well!
I get similar error with
accelerate launch -m axolotl.cli.train llama_lora.yml --deepspeed deepspeed_configs/zero1.json
With config same in examples. Just added additionally
lora_modules_to_save:
- embed_tokens
- lm_head
special_tokens:
bos_token: "<s>"
eos_token: "</s>"
unk_token: "<unk>"
tokens: # these are delimiters
- "<|im_start|>"
- "<|im_end|>"
It works in other 2 cases:
Change lora to qlora.
The error occurs after an epoch is complete
I get similar error with
accelerate launch -m axolotl.cli.train llama_lora.yml --deepspeed deepspeed_configs/zero1.json
With config same in examples.
Just added additionally
lora_modules_to_save: - embed_tokens - lm_head special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>" tokens: # these are delimiters - "<|im_start|>" - "<|im_end|>"
It works in other 2 cases:
If I remove deepseed
Change lora to qlora.
The error occurs after an epoch is complete
In your case, it's usually out of system RAM when it's gathering the weights from the various gpus
@winglian yeah the exit code is -9, that probably relates to system RAM OOM issue, but why that would happen even though I had 800GB free RAM.
Sorry for the necro, but how do you solve this issue if renting compute?
I have the same problem.
The thing I noticed is this only happens after I resume training from a checkpoint, never during the first run (although I can see how this could also happen during the normal run), and it happens during saving a checkpoint (when the model is transferred from the GPU to the system memory). The problem is that we run out of system RAM and the OS kills the process to save itself (otherwise it would crash) - this is a normal behavior of the OS, but the question is why this happens.
If I start training, I can train with no problem (although, again, this is my case, I can see how others can have this problem even during this stage). In the below image, you can see the system RAM usage. The "spikes" are when the checkpoint is being saved (it's set to every 100 steps because of this issue so we do not lose too much training when it happens) and the thing to notice is sometimes uses more RAM for several steps and drops down again: And I can train for how many steps I like. But once I stop the training and restart it from the last checkpoint: It, for some reason, uses more RAM to start and during the whole training, then, on top of this, also has these moments when it consumes more RAM, up to the point when the memory usage rises again and runs out of system RAM.
It seems like something in the system memory is not being cleaned up properly. The charts suggest that possibly:
I'm using multi-GPU training with DeepSpeed ZeRO3 (I'm not using any CPU offload) and training part of the model in this case.
Please check that this issue hasn't been reported before.
Expected Behavior
I firstly ran
python -m axolotl.cli.preprocess examples/llama-2/ver2.0.yml
because I have a lot of data(total_num_tokens: 10394324568) It did ran successfully and data saved inlast_run_prepared
folder. after that, I ranaccelerate launch -m axolotl.cli.train examples/llama-2/ver2.0.yml
to train.Current behaviour
But, the train hangs from here about 10min,
After 10min, it crashes with these message
I thought it was due to something like timelimit, so I modified
is_distributed
function indistributed.py
like below, but it does not helped.I also tried
ddp_timeout: 99999
but it does not work, either.Steps to reproduce
I just used
ver2.0.yml
below, and preprocess and then train.Config yaml
Possible solution
I think it is something to do with time limit, but I don't know how to fix this.
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main
Acknowledgements