alpa-projects / alpa

Training and serving large-scale neural networks with auto parallelization.
https://alpa.ai
Apache License 2.0
3.07k stars 357 forks source link

Problem in building Alpa-modified Jaxlib. #956

Open Fonsifa opened 1 year ago

Fonsifa commented 1 year ago

Please describe the bug

Please describe the expected behavior

System information and environment

To Reproduce Steps to reproduce the behavior: When I try to install alpa from source, and execute python3 build/build.py --enable_cuda --dev_install --bazel_options=--override_repository=org_tensorflow=$(pwd)/../third_party/tensorflow-alpa, some warnings happened. And I don't know if it's related to the error happened in the second pic.

Screenshots If applicable, add screenshots to help explain your problem. image troubleshoot

Code snippet to reproduce the problem

Additional information Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.

Lssyes commented 1 year ago

this bug caused by wrong version of libnccl i solved it by reinstalling a right ver libnccl and recreating a new python env based on this libnccl

Fonsifa commented 1 year ago

this bug caused by wrong version of libnccl i solved it by reinstalling a right ver libnccl and recreating a new python env based on this libnccl

may i ask your concrete version of python and libnccl, thx

Lssyes commented 1 year ago

yeah python == 3.8.13 gcc == 7.5.0 nccl == libnccl.so.2.8.4

ertza commented 11 months ago

Hi, I am running into the same issue when building from source. I don't understand how libnccl version affects the filenotfound error? Any other solution to this?

Fonsifa commented 11 months ago

Hi, I am running into the same issue when building from source. I don't understand how libnccl version affects the filenotfound error? Any other solution to this?

the mirror url is write in some workplace file. it seems the file not found problem not the error reason. the incorrect libnccl version is the main cause.