OpenBMB / MiniCPM-V

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
Apache License 2.0
11.82k stars 828 forks source link

Troubleshooting for LoRA Fine-tuning of MiniCPM-V-2.5 -- ERR: (FAILED: multi_tensor_adam.cuda.o & fused_adam.so: cannot open shared object file: No such file or directory) #341

Closed iwannabewater closed 1 month ago

iwannabewater commented 1 month ago

Environment Details

Issue Description

Encountered errors when attempting to use DeepSpeed for LoRA fine-tuning of MiniCPM-V-2.5.

Error 1: Ninja Build Failure (FAILED: multi_tensor_adam.cuda.o)

Error Message

FAILED: multi_tensor_adam.cuda.o
ninja: build stopped: subcommand failed.

Root Cause

Issue with the ninja build command in PyTorch's C++ extension compilation process.

Solution

Modified the ninja command in the PyTorch utility script:

File: /envs/xxx/lib/python3.xx/site-packages/torch/utils/cpp_extension.py Change: ['ninja', '-v'] to ['ninja', '--version']

Error 2: Shared Object File Not Found (fused_adam.so)

Error Message

ImportError: /home/xxxxx/.cache/torch_extensions/py310_cu121/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory

Attempted Solution (Unsuccessful)

pip uninstall deepspeed
DS_BUILD_FUSED_ADAM=1 pip install deepspeed

Successful Solution

  1. Clone DeepSpeed repository:

    git clone https://github.com/microsoft/DeepSpeed.git
    cd DeepSpeed
  2. Install with specific build flags:

    DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install .
  3. Resolved CUDA and GCC version conflict:

    • Lowered GCC version to 11.3

    • Reinstalled DeepSpeed:

      DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install .

Outcome

After implementing the above solutions, the DeepSpeed installation was successful, and the LoRA fine-tuning code ran without errors. Hope that helps a little.

qyc-98 commented 1 month ago

Thank you!