merge 13B模型和finetune出来的13B结果进程直接被Killed 当时机器内存还有23G，显存未使用

ZenXir commented 1 year ago

大佬老师，我用这个命令merge 13B模型 (merge 7B模型正常，运行正常，merge 13B模型出错，机器内存６４Ｇ，其它进程都关了)： python tools/merge_lora_for_cpp.py --model_path /mnt/e/zllama-models/llama-13b-hf --lora_path ./lora-Vicuna/checkpoint-13B/checkpoint-6000 --out_path ./lora-Vicuna/llama13b-checkpoint-6000-with-lora

尝试了多次

第一次报这个错：

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /root/anaconda3/envs/Chinese-alpaca-lora did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('unix')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA exception! Error code: no CUDA-capable device is detected
CUDA exception! Error code: initialization error
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: No GPU detected! Check your CUDA paths. Proceeding to load CPU-only library...
  warn(msg)
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [02:41<00:00,  3.94s/it]
Traceback (most recent call last):
  File "/mnt/e/Chinese-Vicuna/tools/merge_lora_for_cpp.py", line 123, in <module>
    new_state_dict[new_k] = unpermute(v)
  File "/mnt/e/Chinese-Vicuna/tools/merge_lora_for_cpp.py", line 76, in unpermute
    w.view(n_heads, 2, dim // n_heads // 2, dim).transpose(1, 2).reshape(dim, dim)
RuntimeError: shape '[32, 2, 64, 4096]' is invalid for input of size 26214400

再尝试报这个错，只提示了Killed

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /root/anaconda3/envs/Chinese-alpaca-lora did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('unix')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA exception! Error code: no CUDA-capable device is detected
CUDA exception! Error code: initialization error
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
/root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: No GPU detected! Check your CUDA paths. Proceeding to load CPU-only library...
  warn(msg)
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/anaconda3/envs/Chinese-alpaca-lora/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 41/41 [02:30<00:00,  3.67s/it]
Killed

Facico commented 1 year ago

@ZenXir 你这个情况就是内存炸了（不够），CPU推理比较吃内存，超了之后被强制kill了

LZY-the-boys commented 1 year ago

@ZenXir 可以尝试扩大swap空间，转换的时候也会占用大量内存。另外拉取下更新，之前的代码没有增加对分片的支持，在cpp那里可能会出问题

ZenXir commented 1 year ago

好的　感谢大佬老师们

Facico / Chinese-Vicuna

merge 13B模型和finetune出来的13B结果进程直接被Killed 当时机器内存还有23G，显存未使用 #8

第一次报这个错：

再尝试报这个错，只提示了Killed