MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
Apache License 2.0
11.82k
stars
828
forks
source link
Troubleshooting for LoRA Fine-tuning of MiniCPM-V-2.5 -- ERR: (FAILED: multi_tensor_adam.cuda.o & fused_adam.so: cannot open shared object file: No such file or directory) #341
After implementing the above solutions, the DeepSpeed installation was successful, and the LoRA fine-tuning code ran without errors. Hope that helps a little.
Environment Details
Issue Description
Encountered errors when attempting to use DeepSpeed for LoRA fine-tuning of MiniCPM-V-2.5.
Error 1: Ninja Build Failure (FAILED: multi_tensor_adam.cuda.o)
Error Message
Root Cause
Issue with the ninja build command in PyTorch's C++ extension compilation process.
Solution
Modified the ninja command in the PyTorch utility script:
File:
/envs/xxx/lib/python3.xx/site-packages/torch/utils/cpp_extension.py
Change:['ninja', '-v']
to['ninja', '--version']
Error 2: Shared Object File Not Found (fused_adam.so)
Error Message
Attempted Solution (Unsuccessful)
Successful Solution
Clone DeepSpeed repository:
Install with specific build flags:
Resolved CUDA and GCC version conflict:
Lowered GCC version to 11.3
Reinstalled DeepSpeed:
Outcome
After implementing the above solutions, the DeepSpeed installation was successful, and the LoRA fine-tuning code ran without errors. Hope that helps a little.