8.1.1 - Githubissues

amd / openmm-hip

15 stars 7 forks source link

8.1.1 #14

Open ex-rzr opened 5 months ago

ex-rzr commented 5 months ago

This PR supersedes PRs #7, #8, #9. Closes #11, closes #12, closes #16.

Known issues:

hipRTC from ROCm 6.0.0 has issues so TestHipCompiler fails and OPENMM_USE_HIPRTC cannot be used.

Whoever has opportunity, please build and run tests. I've also prepared a new conda package openmm-hip==8.1.1beta (https://anaconda.org/StreamHPC/openmm-hip/files) The package is built on ROCm 6.0.0, due to binary incompatibilities between 5. and 6., it won't work on ROCm 5.*. I don't know if it's worth to support old ROCm versions and how to do it properly (upload packages with different labels like rocm-5.7, rocm-6.0 so the user will be able to choose the correct version?). I'm open to suggestions.

If everything is ok, we can merge it.

DanielWicz commented 5 months ago

Spec: RHEL 8.8 ROCm 6.0 Driver version: 3602.0 (HSA1.1,LC) GA: MI250

With OPENMM_USE_HIPRTC=False I get only fallowing error:

HIP platform error: Error launching HIP compiler: 256
clang++: error: cannot find HIP runtime; provide its path via '--rocm-path', or pass '-nogpuinc' to build without HIP runtime

With OPENMM_USE_HIPRTC=True, I get fallowing error:

OpenMM Version: 8.1.1
Git Revision: ec797acabe5de4ce9f56c92d349baa889f4b0821

There are 4 Platforms available:

1 Reference - Successfully computed forces
2 HIP - Error computing forces with HIP platform
3 CPU - Successfully computed forces
4 OpenCL - Successfully computed forces

HIP platform error: Error compiling program: /tmp/comgr-cbc8ab/input/CompileSourceiQBp71:443:13: error: use of overloaded operator '*' is ambiguous (with operand types 'floa
t2' (aka 'HIP_vector_type<float, 2>') and 'float')
    return a*scale;
           ~^~~~~~
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/rhel-8.8/8.8/build/hip-on-rocclr/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:2950:1: note: candidate functio
n
COMPLEX_SCALAR_PRODUCT(hipFloatComplex, float)
^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/rhel-8.8/8.8/build/hip-on-rocclr/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:2830:40: note: expanded from ma
cro 'COMPLEX_SCALAR_PRODUCT'
    __HOST_DEVICE__ static inline type operator*(const type& lhs, type1 rhs) {                 \
                                       ^
/tmp/comgr-cbc8ab/input/CompileSourceiQBp71:379:26: note: candidate function
inline __device__ float2 operator*(float2 a, float b) {
                         ^
/tmp/comgr-cbc8ab/input/CompileSourceiQBp71:458:13: error: use of overloaded operator '*' is ambiguous (with operand types 'double2' (aka 'HIP_vector_type<double, 2>') and '
double')
    return a*scale;

/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/rhel-8.8/8.8/build/hip-on-rocclr/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:2975:1: note: candidate functio
n
COMPLEX_SCALAR_PRODUCT(hipDoubleComplex, double)
^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/rhel-8.8/8.8/build/hip-on-rocclr/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:2830:40: note: expanded from macro 'COMPLEX_SCALAR_PRODUCT'
    __HOST_DEVICE__ static inline type operator*(const type& lhs, type1 rhs) {                 \
                                       ^
/tmp/comgr-cbc8ab/input/CompileSourceiQBp71:403:27: note: candidate function
inline __device__ double2 operator*(double2 a, double b) {
                          ^
/tmp/comgr-cbc8ab/input/CompileSourceiQBp71:540:13: error: use of overloaded operator '*' is ambiguous (with operand types 'float2' (aka 'HIP_vector_type<float, 2>') and 'float')
    return a*rsqrtf(a.x*a.x+a.y*a.y);
           ~^~~~~~~~~~~~~~~~~~~~~~~~
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/rhel-8.8/8.8/build/hip-on-rocclr/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:2950:1: note: candidate function
COMPLEX_SCALAR_PRODUCT(hipFloatComplex, float)
^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/rhel-8.8/8.8/build/hip-on-rocclr/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:2830:40: note: expanded from macro 'COMPLEX_SCALAR_PRODUCT'
    __HOST_DEVICE__ static inline type operator*(const type& lhs, type1 rhs) {                 \
                                       ^
/tmp/comgr-cbc8ab/input/CompileSourceiQBp71:379:26: note: candidate function
inline __device__ float2 operator*(float2 a, float b) {
                         ^
/tmp/comgr-cbc8ab/input/CompileSourceiQBp71:552:13: error: use of overloaded operator '*' is ambiguous (with operand types 'double2' (aka 'HIP_vector_type<double, 2>') and 'double')
    return a*rsqrt(a.x*a.x+a.y*a.y);
           ~^~~~~~~~~~~~~~~~~~~~~~~
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/rhel-8.8/8.8/build/hip-on-rocclr/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:2975:1: note: candidate function
COMPLEX_SCALAR_PRODUCT(hipDoubleComplex, double)

/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/rhel-8.8/8.8/build/hip-on-rocclr/hipamd/src/hiprtc/hip_rtc_gen/hipRTC_header.h:2830:40: note: expanded from macro 'COMPLEX_SCALAR_PRODUCT'
    __HOST_DEVICE__ static inline type operator*(const type& lhs, type1 rhs) {                 \
                                       ^
/tmp/comgr-cbc8ab/input/CompileSourceiQBp71:403:27: note: candidate function
inline __device__ double2 operator*(double2 a, double b) {
                          ^

ex-rzr commented 5 months ago

Thank you!

As I said, hipRTC in ROCm 6.0.0 has issues with builtin vector and complex types, according to commit log they seem to be fixed, hopefully they'll be included in the next minor release of ROCm.

By the way, use OPENMM_USE_HIPRTC=1, not True. I don't know why it works considering that the code explicitly checks if the env varible is equal to 1 so both True and False must set it to 0. Strange...

Regarding to the first error (without hipRTC), I have no ideas yet. It looks like the ROCm installation is not ok. Do other things work there? Can you provide output of /opt/rocm/bin/hipconfig?

Update: As a workaround I suggest t try with DEVICE_LIB_PATH=/opt/rocm/amdgcn/bitcode (Please check if this path exists and contains a bunch of .bc files)

bdenhollander commented 5 months ago

FWIW, hipRTC is also failing on Windows with the newly released 5.7.1 SDK when compiling vector operations. It worked fine with the 5.5.1 SDK. hipcc works properly in 5.7.1 and performance is similar to 5.5.1.

DanielWicz commented 5 months ago

/opt/rocm/bin/hipconfig

hipconfig content of 6.0.0

# Need perl > 5.10 to use logic-defined or
use 5.006; use v5.10.1;

use warnings;

use File::Basename;
use File::Spec::Functions 'catfile';

#TODO: By default select perl script until change incorporated in HIP build script
my $USE_PERL_SCRIPT = $ENV{'HIP_USE_PERL_SCRIPTS'};
$USE_PERL_SCRIPT //= 1;  # use defined-or assignment operator.  Use env var, but if not defined default to 1.
my $isWindows =  ($^O eq 'MSWin32' or $^O eq 'msys');
my $SCRIPT_DIR=dirname(__FILE__);
if ($USE_PERL_SCRIPT) {
    #Invoke hipconfig.pl
    my $HIPCONFIG_PERL=catfile($SCRIPT_DIR, '/hipconfig.pl');
    system($^X, $HIPCONFIG_PERL, @ARGV);
} else {
    $BIN_NAME="/hipconfig.bin";
    if ($isWindows) {
        $BIN_NAME="/hipconfig.bin.exe";
    }
    my $HIPCONFIG_BIN=catfile($SCRIPT_DIR, $BIN_NAME);
    if ( -e $HIPCONFIG_BIN ) {
        #Invoke hipconfig.bin
        system($HIPCONFIG_BIN, @ARGV);
    } else {
        print "hipconfig.bin not present; Install HIPCC binaries before proceeding";
        exit(-1);
    }
}

# Because of this wrapper we need to check
# the output of the system command for perl and bin
# else the failures are ignored and build fails silently
if ($? == -1) {
    exit($?);
}
elsif ($? & 127) {
    exit($?);
}
else {
     $CMD_EXIT_CODE = $? >> 8;
}
exit($CMD_EXIT_CODE);

For comparison working 5.3.0

# Need perl > 5.10 to use logic-defined or
use 5.006; use v5.10.1;

use warnings;

use File::Basename;
use File::Spec::Functions 'catfile';

#TODO: By default select perl script until change incorporated in HIP build script
my $HIPCONFIG_USE_PERL_SCRIPT = 1;
my $isWindows =  ($^O eq 'MSWin32' or $^O eq 'msys');
my $SCRIPT_DIR=dirname(__FILE__);

if ($HIPCONFIG_USE_PERL_SCRIPT) {
    #Invoke hipconfig.pl
    my $HIPCONFIG_PERL=catfile($SCRIPT_DIR, '/hipconfig.pl');
    system($^X, $HIPCONFIG_PERL, @ARGV);
} else {
    $BIN_NAME="/hipconfig.bin";
    if ($isWindows) {
        $BIN_NAME="/hipconfig.bin.exe";
    }
    my $HIPCONFIG_BIN=catfile($SCRIPT_DIR, $BIN_NAME);
    if ( -e $HIPCONFIG_BIN ) {
        #Invoke hipconfig.bin
        system($HIPCONFIG_BIN, @ARGV);
    } else {
        print "hipconfig.bin not present; Install HIPCC binaries before proceeding";
        exit(-1);
    }
}

# Because of this wrapper we need to check
# the output of the system command for perl and bin
# else the failures are ignored and build fails silently
if ($? == -1) {
    exit($?);
}
elsif ($? & 127) {
    exit($?);
}
else {
     $CMD_EXIT_CODE = $? >> 8;
}
exit($CMD_EXIT_CODE);

Diff shows these differences between 5.3.0 and 6.0.0

31,32c31
< my $USE_PERL_SCRIPT = $ENV{'HIP_USE_PERL_SCRIPTS'};
< $USE_PERL_SCRIPT //= 1;  # use defined-or assignment operator.  Use env var, but if not defined default to 1.
---
> my $HIPCONFIG_USE_PERL_SCRIPT = 1;
35c34,35
< if ($USE_PERL_SCRIPT) {
---
> 
> if ($HIPCONFIG_USE_PERL_SCRIPT) {

But the 5.5.1 that works, has no differences between 5.5.1 and 6.0.0. So I think there should be no problem, except 6.0.0 and 5.7.1 (that also has the same problem) should be modified.

With regard to

Update: As a workaround I suggest t try with DEVICE_LIB_PATH=/opt/rocm/amdgcn/bitcode (Please check if this path exists and contains a bunch of .bc files)

I will give an answer in the next 30minutes

ex-rzr commented 5 months ago

No, I mean what hipconfig prints when you run it (OpenMM-HIP doesn't use hipconfig, I just want to be sure that your ROCm installation is not broken because it looks like something is not right there)

DanielWicz commented 5 months ago

No, I mean what hipconfig prints when you run it (OpenMM-HIP doesn't use hipconfig, I just want to be sure that your ROCm installation is not broken because it looks like something is not right there)

But how to run hipconfig ? I just run it as a sh script ?

Update: As a workaround I suggest t try with DEVICE_LIB_PATH=/opt/rocm/amdgcn/bitcode (Please check if this path exists and contains a bunch of .bc files)

When I set the variable as export DEVICE_LIB_PATH="/opt/rocm-6.0.0/amdgcn/bitcode", there's still an error:

HIP platform error: Error launching HIP compiler: 256
clang++: error: cannot find HIP runtime; provide its path via '--rocm-path', or pass '-nogpuinc' to build without HIP runtime

edit: when I run hipconfig as a comand, it gives me:

Can't exec "/opt/rocm-6.0.0/hip/bin/hipcc": No such file or directory at /opt/rocm-6.0.0/bin//hipconfig.pl line 175.
Can't exec "/opt/rocm-6.0.0/hip/bin/hipcc": No such file or directory at /opt/rocm-6.0.0/bin//hipconfig.pl line 178.

The output in general is:

HIP version  : 5.7.0-0

== hipconfig
HIP_PATH     : /opt/rocm-6.0.0/hip
ROCM_PATH    : /opt/rocm-6.0.0
HIP_COMPILER : clang
HIP_PLATFORM : amd
HIP_RUNTIME  : rocclr
CPP_CONFIG   :  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.0.0/hip/include -I/lus/home/softs/rocm/6.0.0/lib/llvm/lib/clang/17.0.0

== hip-clang
HIP_CLANG_PATH   : /opt/rocm-6.0.0/llvm/bin
AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.0.0 23483 7208e8d15fbf218deb74483ea8c549c67ca4985e)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-6.0.0/llvm/bin
Configuration file: /lus/home/softs/rocm/6.0.0/lib/llvm/bin/clang++.cfg
AMD LLVM version 17.0.0git
  Optimized build.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: znver3

  Registered Targets:
    amdgcn - AMD GCN GPUs
    r600   - AMD GPUs HD2XXX-HD6XXX
    x86    - 32-bit X86: Pentium-Pro and above
    x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags : 
hip-clang-ldflags  :

It looks like there's no "hip" directory at all. Can I change it with some env. variable ? As the computer center will take weeks to repair it.

The 5.5.1 has hip/bin directory and it looks like:

The 5.7.1 has hip/bin directory and gives the same error as 6.0.0, and it looks like:

The 6.0.0 has no hip/bin. It has bin directly and it looks like:

But both 5.7.1 and 5.5.1 have this bin directory aside of the hip/bin

ex-rzr commented 5 months ago

Something is definitely not right with your ROCm. Perhaps, it's worth to uninstall and install it from scratch (of course if you have admin privileges)

DanielWicz commented 5 months ago

Something is definitely not right with your ROCm. Perhaps, it's worth to uninstall and install it from scratch (of course if you have admin privileges)

I opened a ticket to repair it. It's weird, because the company states "we did regression testing on the machine with new ROCm, and didn't find any issues".

DanielWicz commented 5 months ago

Managed to solve the problem with administration, but the graphic cards drivers were updated. This resulted in inability to run ROCm that is older than 5.7.1. So the oldest that I can run is 5.7.1 at the moment.

In the comments where I mentioned Segfault - it is for systems that are either missing on the graph or market with 0 performance. Usually it is for the amber benchmark and more than 1 gpu

Here is the comparison ROCm 5.7.1 (OpenMM 8.0, with OpenMM HIP for 8.0) vs ROCm 6.0.0 (OpenMM 8.1.1 with OpenMM HIP 8.1.1). Graphic card is MI250, single GPU (so only 1/2 of MI250 in fact).

MI250 1GPU (1/2 of the card)

I will update this post with different number of graphic cards as the time goes on.

2 GPUs (full MI250)

### Comment: With 2 GPUs, OPENMM 8.0.0/5.7.1 is segfaulting with "Memory access fault by GPU node-4 (Agent handle: 0x560656fbf4d0) on address 0x145113be5000. Reason: Unknown." whole amber suite

3 GPUs (3/4 MI250)

### Comment: With 4 GPUs, OPENMM 8.0.0/5.7.1 is segfaulting with "Memory access fault by GPU node-4 (Agent handle: 0x560656fbf4d0) on address 0x145113be5000. Reason: Unknown." whole amber suite

Comment: With 4 GPUs, OPENMM 8.1.1/6.0.0 is segfaulting with "Memory access fault by GPU node-4 (Agent handle: 0x562454718330) on address 0x562472e00000. Reason: Unknown."

4 GPUs (2 MI250)

### Comment: With 4 GPUs, OPENMM 8.0.0/5.7.1 is segfaulting with "Memory access fault by GPU node-4 (Agent handle: 0x560656fbf4d0) on address 0x145113be5000. Reason: Unknown." whole amber suite

8 GPUs (4 full MI250):

### Comment: With 8 GPUs, OPENMM 8.0.0/5.7.1 is segfaulting with "Memory access fault by GPU node-4 (Agent handle: 0x560656fbf4d0) on address 0x145113be5000. Reason: Unknown." whole amber suite

Comment: With 8 GPUs, OPENMM 8.1.1/6.0.0 is segfaulting with "Memory access fault by GPU node-4 (Agent handle: 0x562454718330) on address 0x562472e00000. Reason: Unknown." - Generally stm virus simulation

ex-rzr commented 5 months ago

Thanks you! I need to analyze the results especially these crashes in multi-GPU simulations.

Can you run tests?
```
cd $CONDA_PREFIX/share/openmm/tests/
./test_openmm_hip.sh
```
There is a chance that some of the tests may freeze occasionally. I suspect a bug in hipModuleLoad, I'm going to create a reproducer to prove/disprove it.
The first chart, gbsa. Is it reproducible? I didn't see anything strange on MI100 and V620.

egallicc commented 5 months ago

Hello. I tested the openmm-hip:develop_stream branch built against Rocm 5.6 and OpenMM 8.1.1 with ATMForce() on a system with two cards (RX 6750 XT + RX 6650 XT). It works great with good performance in production with the AToM-OpenMM 8.1.1 middleware.

I can't comment on the multi-GPu failures. AToM-OpenMM uses multiple GPUs but in a distributed asynchronous mode rather than a parallel mode. Thanks.

ex-rzr commented 5 months ago

@egallicc Thank you!

@DanielWicz I benchmarked on multiple GPUs: on one server with 2 MI100 and on another with 2 V620. GPUs on both servers are connected via PCIE, not XGMI. All benchmarks were stable. I even tried on to run with --device=0,1,0,1, i.e. use the devices twice, this was also stable.

I wonder if something may be wrong with your XGMI configuration. Could you post here what rocm-smi --showtopo reports? /opt/rocm/bin/rocm-bandwidth-test also can be useful (you will likely need to install the rocm-bandwidth-test package though).

ex-rzr commented 5 months ago

@jdmaia

Hi! Can you run tests and benchmarks on this branch? OpenMM's tag is 8.1.1, any GPUs, but MI200 and MI300 are more important because I don't have access to them.

I'm especially interested in multi-GPU stability:

python3 benchmark.py --verbose --style=table --platform=HIP --precision=single --device=0,1

(and --device=0,1,2,3 if you have more GPUs).

Thanks!

DanielWicz commented 5 months ago

@egallicc Thank you!

@DanielWicz I benchmarked on multiple GPUs: on one server with 2 MI100 and on another with 2 V620. GPUs on both servers are connected via PCIE, not XGMI. All benchmarks were stable. I even tried on to run with --device=0,1,0,1, i.e. use the devices twice, this was also stable.

I wonder if something may be wrong with your XGMI configuration. Could you post here what rocm-smi --showtopo reports? /opt/rocm/bin/rocm-bandwidth-test also can be useful (you will likely need to install the rocm-bandwidth-test package though).

I will retest with regard to the gba. Generally there were some updates on our nodes and it gets "1213.11 ns/day". But still I get these "segmentation faults" for multiple gpus.

Here is the output of rocm-smi

============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            15           15           30           30           30           15           30           
GPU1   15           0            30           15           30           15           30           45           
GPU2   15           30           0            15           15           30           30           30           
GPU3   30           15           15           0            30           45           30           15           
GPU4   30           30           15           30           0            15           15           30           
GPU5   30           15           30           45           15           0            30           15           
GPU6   15           30           30           30           15           30           0            15           
GPU7   30           45           30           15           30           15           15           0            

================================= Hops between two GPUs ==================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            1            1            1            1            1            1            1            
GPU1   1            0            1            1            1            1            1            1            
GPU2   1            1            0            1            1            1            1            1            
GPU3   1            1            1            0            1            1            1            1            
GPU4   1            1            1            1            0            1            1            1            
GPU5   1            1            1            1            1            0            1            1            
GPU6   1            1            1            1            1            1            0            1            
GPU7   1            1            1            1            1            1            1            0         
=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         
GPU1   XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         
GPU2   XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         
GPU3   XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         
GPU4   XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         
GPU5   XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         
GPU6   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         
GPU7   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            

======================================= Numa Nodes =======================================
GPU[0]          : (Topology) Numa Node: 3
GPU[0]          : (Topology) Numa Affinity: 3
GPU[1]          : (Topology) Numa Node: 3
GPU[1]          : (Topology) Numa Affinity: 3
GPU[2]          : (Topology) Numa Node: 1
GPU[2]          : (Topology) Numa Affinity: 1
GPU[3]          : (Topology) Numa Node: 1
GPU[3]          : (Topology) Numa Affinity: 1
GPU[4]          : (Topology) Numa Node: 0
GPU[4]          : (Topology) Numa Affinity: 0
GPU[5]          : (Topology) Numa Node: 0
GPU[5]          : (Topology) Numa Affinity: 0
GPU[6]          : (Topology) Numa Node: 2
GPU[6]          : (Topology) Numa Affinity: 2
GPU[7]          : (Topology) Numa Node: 2
GPU[7]          : (Topology) Numa Affinity: 2
================================== End of ROCm SMI Log ===================================
q

DanielWicz commented 4 months ago

@jdmaia

Hi! Can you run tests and benchmarks on this branch? OpenMM's tag is 8.1.1, any GPUs, but MI200 and MI300 are more important because I don't have access to them.

I'm especially interested in multi-GPU stability:
python3 benchmark.py --verbose --style=table --platform=HIP --precision=single --device=0,1
(and --device=0,1,2,3 if you have more GPUs).

Thanks!

Why do you recommend precision simple ? Usually OpenMM uses mixed (simple is not recommended in most cases).

ex-rzr commented 4 months ago

Why do you recommend precision simple ? Usually OpenMM uses mixed (simple is not recommended in most cases).

No reason, except that single is the default precision of the benchmark.py. But I assume, you get crashes for all 3 precisions in multi-GPU, right?

DanielWicz commented 4 months ago

Why do you recommend precision simple ? Usually OpenMM uses mixed (simple is not recommended in most cases).

No reason, except that single is the default precision of the benchmark.py. But I assume, you get crashes for all 3 precisions in multi-GPU, right?

I tried on single. Still I get "Memory access fault by GPU node-4 (Agent handle: 0x55b40d3733a0) on address 0x14dd267e2000. Reason: Unknown." for multiple GPUs. Both for the older version and the newer version. Where on the newer version is less often. Usually those who crash are the Amber systems. So it somehow related to the memory allocation.

Maybe should I try some env variable related to memory allocation or PME ?

Those some of the people reported similar problem: https://github.com/pytorch/pytorch/pull/95810

Can I graph somehow vmem over time ? I have strong suspicion that the VRAM is not released between each run inside Python file.

Edit: When running in the same bash script benchmarks in a different order, where order is 1, 2, 3, 4, 5, 6, 7, 8, where number signifies NUMBER OF GPUs used; I got different number of segmentation faults. For example: Running in order 1, 2, 3, 4, 5, 6, 7, 8 gives me 2 segmentation faults. Running in order 8, 1, 2, 3, 4, 5, 6, 7 gives me 4 segmentation faults.

ex-rzr commented 4 months ago

I asked my colleague with access to MI200 to run multi-GPU benchmarks: 2 GPUs and 4 GPUs ran without issues. The system has ROCm 5.7. So I think that something is configured incorrectly on your system.

Did you try to run rocm-bandwidth-test? I expect that it will fail if XGMI does not work correctly.

Regarding your question about vmem. You can use rocm-smi. To get the info continuously use watch -n 1 rocm-smi.

Even if there were VRAM leaks they unlikely caused the crash considering that most of benchmarks are very small and MI200 has a lot of memory.

You can also try to run with AMD_SERIALIZE_KERNEL=3 AMD_SERIALIZE_COPY=3 AMD_LOG_LEVEL=4 but expect a flood of log messages unless the crash happens in the very beginning of the simulation. The last lines may help understand what happens.

ex-rzr commented 3 months ago

@jdmaia Did you have a chance to run tests and benchmarks? I think the PR should be merged.