amd / openmm-hip

15 stars 7 forks source link

HIP Kernel Compiler Issue #18

Open nikhil-tensorwave opened 3 weeks ago

nikhil-tensorwave commented 3 weeks ago

When building OpenMM-HIP and running make test I am running into HIP compiler errors. These errors are of the type

Error creating kernel <kernel function name>: hipErrorNotFound (500)

I'm also getting

Error launching HIP compiler: 256

Runtime environment: ROCm 6.1.1 Ubuntu 22.04 Python 3.10 PyTorch 2.4.0

These were the setup steps used:

## build openmm
git clone https://github.com/openmm/openmm.git
git checkout 8.1.1
cd openmm
mkdir -p build/install
cd build
cmake ../ -D CMAKE_INSTALL_PREFIX=./install -D PYTHON_EXECUTABLE=/usr/bin/python3 -D OPENMM_BUILD_COMMON=ON -D OPENMM_PYTHON_USER_INSTALL=OFF -D CMAKE_CXX_FLAGS_RELEASE="-O3 -DNDEBUG -D_GLIBCXX_USE_CXX11_ABI=0"
make -j128
make test
make install
cd ../..

## build openmm-hip
git clone https://github.com/amd/openmm-hip.git
cd openmm-hip
git checkout mi300_changes  # necessary for ROCm 6.0!
mkdir build && cd build
cmake ../ -D OPENMM_DIR=../../openmm/build/install -D OPENMM_SOURCE_DIR=../../openmm -D CMAKE_INSTALL_PREFIX=../../openmm/build/install -D CMAKE_CXX_FLAGS_RELEASE="-O3 -DNDEBUG -D_GLIBCXX_USE_CXX11_ABI=0"
make -j128
make test  # these mostly fail with above errors
ctest -j 128 --rerun-failed  # if you keep rerunning them, more and more pass

When rerunning the make tests, a small percentage will pass.

Any help on this would be appreciated.

ex-rzr commented 3 weeks ago

I've never seen such errors.

Can you check with this branch https://github.com/amd/openmm-hip/pull/14 (https://github.com/StreamHPC/openmm-hip/tree/develop_stream)?

Also can your run ctest without -j (ctest --output-on-failure)? Perhaps something is wrong with concurrent compilation/running.

Btw, what GPUs do you use?

nikhil-tensorwave commented 3 weeks ago

Switching to that branch and adding gfx942 to the list of GPU architectures fixed the issue! Thank you very much for the help. Also, we're running on Mi300Xs