Closed rothfels closed 3 weeks ago
I was able to resolve this issue for the Phi3-Vision-Finetune repo by setting up a conda virtual environment with environment.yaml
instead of venv
/requirements.txt
. This tells me that the segfault is coming from some bad combination of torch/cuda/deepspeed.
For this repo, you've included an environment.yaml
, but when I run conda env create -f environment.yaml
I get the following error:
Pip subprocess error:
ERROR: Ignored the following versions that require a different python version: 0.36.0 Requires-Python >=3.6,<3.10; 0.37.0 Requires-Python >=3.7,<3.10; 0.52.0 Requires-Python >=3.6,<3.9; 0.52.0rc3 Requires-Python >=3.6,<3.9; 0.53.0 Requires-Python >=3.6,<3.10; 0.53.0rc1.post1 Requires-Python >=3.6,<3.10; 0.53.0rc2 Requires-Python >=3.6,<3.10; 0.53.0rc3 Requires-Python >=3.6,<3.10; 0.53.1 Requires-Python >=3.6,<3.10; 0.54.0 Requires-Python >=3.7,<3.10; 0.54.0rc2 Requires-Python >=3.7,<3.10; 0.54.0rc3 Requires-Python >=3.7,<3.10; 0.54.1 Requires-Python >=3.7,<3.10
ERROR: Could not find a version that satisfies the requirement onnxruntime-genai-cuda==0.3.0 (from versions: 0.4.0)
ERROR: No matching distribution found for onnxruntime-genai-cuda==0.3.0
failed
CondaEnvException: Pip failed
@rothfels Thanks for the update. When I was testing, I think the env was mixed up, and I've exported to the yaml. I really appreciate for your help for setting the env. I'll merge the PR.
@2U1 no problem.
In addition to those changes, the conda environment can't initialize on ubuntu without a few more things:
I'm not sure about the second two, but the first is coming from the mistralrs-cuda
dep. (Tbh I'm not even sure what that's for. Can it be removed?)
Either way, here was the rest of what I needed to do to set up ubuntu if you want to mention it in the README:
# Install rustc
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
# Verify rust installation
rustc --version
cargo --version
# Install pkg-config and openssl
sudo apt update
sudo apt install -y libssl-dev pkg-config
# Verify openssl installation
pkg-config --modversion openssl
Sorry for the issue, it can be removed. It was for the serving not for the training. I will clean up the env file bit more.
@rothfels I think it should work now. The env file was a messed up version. I didn't realize it that becuase repos I made was fine. Thanks for letting me know.
Thanks for fixing!
I'll close this issue for resloving the wrong environment yaml.
I tried running the full fine-tuning script on an 8xH100 from lambda labs but it errors with a segfault (code -11)
I cannot reproduce the failure running the same script on a 1xH100.
I'm able to produce the same segfault running the Phi3-V funetuning from https://github.com/2U1/Phi3-Vision-Finetune , but still only on an 8xH100 machine (no error on a 1x).