Troubleshooting for LoRA Fine-tuning of MiniCPM-V-2.5 -- ERR: (FAILED: multi_tensor_adam.cuda.o & fused_adam.so: cannot open shared object file: No such file or directory)

Environment Details

OS: Ubuntu 22.04
NVIDIA Driver: 535.54.03
CUDA Version (nvcc -V): 12.2
Python Version: 3.10
PyTorch Version: 2.3.0+cu121
GCC Version: 13 (initially)

Issue Description

Encountered errors when attempting to use DeepSpeed for LoRA fine-tuning of MiniCPM-V-2.5.

Error 1: Ninja Build Failure (FAILED: multi_tensor_adam.cuda.o)

Error Message

FAILED: multi_tensor_adam.cuda.o
ninja: build stopped: subcommand failed.

Root Cause

Issue with the ninja build command in PyTorch's C++ extension compilation process.

Solution

Modified the ninja command in the PyTorch utility script:

File: /envs/xxx/lib/python3.xx/site-packages/torch/utils/cpp_extension.py Change: ['ninja', '-v'] to ['ninja', '--version']

Error 2: Shared Object File Not Found (fused_adam.so)

Error Message

ImportError: /home/xxxxx/.cache/torch_extensions/py310_cu121/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory

Attempted Solution (Unsuccessful)

pip uninstall deepspeed
DS_BUILD_FUSED_ADAM=1 pip install deepspeed

Successful Solution

Clone DeepSpeed repository:

git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed

Install with specific build flags:

DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install .

Resolved CUDA and GCC version conflict:
- Lowered GCC version to 11.3
- Reinstalled DeepSpeed:
```
DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install .
```

Outcome

After implementing the above solutions, the DeepSpeed installation was successful, and the LoRA fine-tuning code ran without errors. Hope that helps a little.

OpenBMB / MiniCPM-V