NVIDIA / JAX-Toolbox

JAX-Toolbox
Apache License 2.0
206 stars 35 forks source link

EFA Support #167

Open abhinavgoel95 opened 10 months ago

abhinavgoel95 commented 10 months ago

Hello team,

I have noticed that our Jax Toolbox containers do not contain the requirements needed to install EFA.

When following the instructions, I get the following error: error while loading shared libraries: libefa.so.1: cannot open shared object file: No such file or directory

This can be a problem because EFA is needed to maximize the bandwidth.

Could someone look into this? Thanks!

mjsML commented 10 months ago

@abhinavgoel95 please give @yhtang admin access to AWS account to be able to help out.

yhtang commented 10 months ago

This is a good first issue for @DwarKapex 😃

yhtang commented 10 months ago

Forwarding more offline comments from Abhinav:

To more succinctly summarize the problem:

expected behavior: root@ipp1-0274:/workspace# find / -name libefa* /usr/lib/x86_64-linux-gnu/pkgconfig/libefa.pc /usr/lib/x86_64-linux-gnu/libefa.so.1 /usr/lib/x86_64-linux-gnu/libefa.so.1.1.39.0 /usr/lib/x86_64-linux-gnu/libibverbs/libefa-rdmav34.so

Current behavior in ghcr.io/nvidia/pax:nightly-2023-07-18 root@ipp1-0274:/# find / -name libefa* root@ipp1-0274:/#