ROCm / omnitrace

Omnitrace: Application Profiling, Tracing, and Analysis
https://rocm.docs.amd.com/projects/omnitrace/en/latest/
MIT License
291 stars 23 forks source link

torch.cuda.is_available() aborts after module loading omnitrace #336

Open R0n12 opened 5 months ago

R0n12 commented 5 months ago

Before loading omnitrace:

(gpt-neox-rocm5.6.0) langx@frontier07915:/lustre/orion/csc549/scratch/langx> python
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.hip
'5.6.31061-8c743ae5d'
>>> torch.cuda.is_available()
True

After loading omnitrace/1.10.4:

(gpt-neox-rocm5.6.0) langx@frontier07915:/lustre/orion/csc549/scratch/langx> module load omnitrace/1.10.4
Using ROCm installation: /opt/rocm-5.6.0
(gpt-neox-rocm5.6.0) langx@frontier07915:/lustre/orion/csc549/scratch/langx> module list

Currently Loaded Modules:
  1) craype-x86-trento                       7) cce/15.0.0             13) darshan-runtime/3.4.0
  2) libfabric/1.15.2.0                      8) craype/2.7.19          14) hsi/default
  3) craype-network-ofi                      9) cray-dsmml/0.2.2       15) DefApps/default
  4) perftools-base/22.12.0                 10) cray-mpich/8.1.23      16) tmux/3.2a
  5) xpmem/2.6.2-2.5_2.22__gd067c3f.shasta  11) cray-libsci/22.12.1.1  17) rocm/5.6.0
  6) cray-pmi/6.1.8                         12) PrgEnv-cray/8.3.3      18) omnitrace/1.10.4

(gpt-neox-rocm5.6.0) langx@frontier07915:/lustre/orion/csc549/scratch/langx> python
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.hip
'5.6.31061-8c743ae5d'
>>> torch.cuda.is_available()
Aborted

PyTorch verison: 2.1.2+rocm5.6.0 Omnitrace: 1.10.4

Is there something that needs to be checked first?

I have attached my rocminfo output, note that since on MI250X we don't support the get_power_avg() functions, it is reflected as an unsupported feature, however, outputting functions.json still hangs at the end. rocminfo.log

Thanks in advance!

jrmadsen commented 5 months ago

I have attached my rocminfo output, note that since on MI250X we don't support the get_power_avg() functions, it is reflected as an unsupported feature, however, outputting functions.json still hangs at the end. rocminfo.log

This was fixed in #331 and included in the v1.11.1 release.

However, I don’t think this is related to your problem whatsoever. Could you do a module show for that omnitrace module? And maybe compare the env before/after. I’m thinking there’s something being changed with regards to the LD_LIBRARY_PATH and the PYTHONPATH when that module gets loaded.