model saving error - Githubissues

Facico / Chinese-Vicuna

Chinese-Vicuna: A Chinese Instruction-following LLaMA-based Model —— 一个中文低资源的llama+lora方案，结构参考alpaca

https://github.com/Facico/Chinese-Vicuna

Apache License 2.0

4.14k stars 421 forks source link

model saving error #81

Closed imrankh46 closed 1 year ago

imrankh46 commented 1 year ago

the trainer not save the mode weights . its give me the following error

OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 14.75 
GiB total capacity; 12.97 GiB already allocated; 6.81 MiB free; 13.69 GiB 
reserved in total by PyTorch) If reserved memory is >> allocated memory try 
setting max_split_size_mb to avoid fragmentation.  See documentation for Memory 
Management and PYTORCH_CUDA_ALLOC_CONF

Facico commented 1 year ago

Your error is exceeding the GPU memory limit. It should be unrelated to model saving. Did your program train properly when it was running?

imrankh46 commented 1 year ago

Your error is exceeding the GPU memory limit. It should be unrelated to model saving. Did your program train properly when it was running?

No, when all the epochs completed, so they showing this behavior. We can not save llama weights like other model using trianer.save_pretrianed() method Or model.save_model().

SunnyMarkLiu commented 1 year ago

Same error to me!

Facico commented 1 year ago

What is the version of your transformers?

imrankh46 commented 1 year ago

Same error to me!

I solve the error. Just add this code.

model.cpp()

And then save the model

imrankh46 commented 1 year ago

What is the version of your transformers?

Same like you.

Facico commented 1 year ago

@imrankh46 Our transformers is pulled directly from github, so there may be a slight difference. The commit hash of our transformers at the time was roughly the same as ff20f9cf3615a8638023bc82925573cb9d0f3560. Maybe you can slove the question by uninstalling transformers and reinstalling it as "git+https://github.com/huggingface/transformers@ff20f9cf3615a8638023bc82925573cb9d0f3560"

imrankh46 commented 1 year ago

@imrankh46 Our transformers is pulled directly from github, so there may be a slight difference. The commit hash of our transformers at the time was roughly the same as ff20f9cf3615a8638023bc82925573cb9d0f3560. Maybe you can slove the question by uninstalling transformers and reinstalling it as "git+https://github.com/huggingface/transformers@ff20f9cf3615a8638023bc82925573cb9d0f3560"

I tried, but they not working. I think The llama model code or tokenizer written in cpp. The model is train successfully.

After saving they give out of cuda error.

I will also try your approach..

Facico commented 1 year ago

same issue in other repo like you. You can also refer to their method to downgrade the version of bitsandbytes

imrankh46 commented 1 year ago

same issue in other repo like you. You can also refer to their method to downgrade the version of bitsandbytes

Thank you.