Port large blocks enhancement

bdenhollander commented 7 months ago

OpenMM 8.1 has a performance improvement for larger systems: https://github.com/openmm/openmm/pull/4147.

I attempted to port it myself but was unsuccessful. findInteractingBlocks.hip has diverged a lot from findInteractingBlocks.cu and it is also modified to execute as a flat kernel. Is there any chance this enhancement could be ported to HIP?

ex-rzr commented 7 months ago

I did some experiments with porting these changes some time ago (including also half precision for block sizes) but I couldn't see any measurable difference in performance on CDNA (MI100 and MI210) for standard benchmarks. I had no opportunity to benchmark on RDNA. Perhaps RDNA will have a speedup. IIRC findBlocksWithInteractions on CDNA compared to computeNonbonded is not less significant: findBlocksWithInteractions duration/computeNonbonded duration is about 1/4 for cellulose and 1/3 for stmv on CDNA, and 1/3 for cellulose and 1/2 for stmv on RDNA. Perhaps this explains why I didn't notice improvements on CDNA: faster findBlocksWithInteractions and slower computeNonbonded (due to less optimal neighbor lists?).

Now I have access to RNDA, let me check how it performs there. I'll also push a new branch which you can try on your GPU.

Btw, do you have a script for those synthetic simulations (water boxes)? I couldn't find it in my files though I'm sure I tried it before.

bdenhollander commented 7 months ago

There is definitely a trade-off as mentioned in the comment in the source code. The cutoffs of 90K (CUDA) and 100K (OpenCL) atoms seemed correct but later I noticed that 250K might be better for a wider range of simulations.

Here's the water box script:

from openmm import *
from openmm.app import *
from openmm.unit import *
import sys

width = float(sys.argv[1])
steps = int(sys.argv[2])
ff = ForceField('tip3pfb.xml')
modeller = Modeller(Topology(), [])
modeller.addSolvent(ff, boxSize=Vec3(width, width, width)*nanometers)
system = ff.createSystem(modeller.topology, nonbondedMethod=PME, nonbondedCutoff=1.0)
print(f'{width} {system.getNumParticles()}' )
integrator = LangevinMiddleIntegrator(300, 1.0, 0.004)
platform = Platform.getPlatformByName('HIP')
simulation = Simulation(modeller.topology, system, integrator, platform)
simulation.reporters.append(StateDataReporter(sys.stdout, int(steps/5), step=True, elapsedTime=True, speed=True))
simulation.context.setPositions(modeller.positions)
simulation.minimizeEnergy(tolerance=100*kilojoules_per_mole/nanometer)
simulation.step(steps)

I call it from a script like this:

python run-water-box.py 3 5000
python run-water-box.py 4 5000
python run-water-box.py 5 5000

ex-rzr commented 7 months ago

I have pushed the current stage of porting: https://github.com/StreamHPC/openmm-hip/commits/large-blocks Currently large blocks are used for all numbers of atoms (so ignore regressions in small systems).

I collected benchmarking and profiling results for all standard benchmarks and the water box test. And I don't understand these results yet:

amber20-cellulose is slightly faster on all 3 GPUs
findBlocksWithInteractions' duration is almost the same: +3% on MI100, +22% on MI210, -9% on V620 (RDNA).
amber20-stmv is slower with large blocks
findBlocksWithInteractions is much longer: +93% on MI100, +128% on MI210, +57% on V620
BUT large systems of the water box test become significantly faster findBlocksWithInteractions is significantly shorter, for width 30 the change reaches enormous -72% on MI100, -58% on MI210, -37% on V620.

I understand that the water box is a very "synthetic" system but I didn't expect such a big difference in behavior. I'll try to investigate it further, perhaps some of my early changes conflict with this change, there is also a chance that the old kernel was perfect but I doubt it :) I'm still not 100% sure that I didn't miss something.

If possible please run it on your 6600. I'm also interested in performance of FAH projects. Do you have instructions how I can run them on my GPUs?

bdenhollander commented 7 months ago

Gains on large systems are definitely less than expected compared to OpenCL on my RX 6600 on Windows 10. I suspect this is partially due to the flat kernel. There is less of a long tail of declining utilization with HIP than OpenCL. It takes a lot more atoms to make the extra computation of the large blocks worthwhile.

My results are inline with yours for 250K+ atoms except I didn't see a decline in amber20-stmv.

Benchmark	Atoms	Original ns/day	Large Blocks ns/day	Ratio
pme	23558	688.724	663.377	0.9632
apoa1pme	92224	170.909	176.629	1.0335
amber20-cellulose	408609	40.4076	41.9038	1.0370
amber20-stmv	1067095	12.512	13.4015	1.0711

Water boxes

Width (nm)	Atoms	Original ns/day	Large Blocks ns/day	Ratio
4	6282	1080	1100	1.0185
5	12255	652	648	0.9939
6	21384	620	708	1.1419
7	33840	428	481	1.1238
8	50442	299	302	1.0100
9	72279	220	220	1.0000
10	98880	163	164	1.0061
11	131445	131	123	0.9389
15	335025	48.2	48.8	1.0124
20	792450	17.9	18.8	1.0503
25	1550160	6.87	8.32	1.2111
30	2682600	3.66	4.36	1.1913

bdenhollander commented 7 months ago

For running FAH projects, I have the following script:

from simtk import openmm, unit
import time
import 

nsteps = 5000
runs = ['12271']
runs.sort()

platform = openmm.Platform.getPlatformByName('HIP')
#platform.setPropertyDefaultValue('DisablePmeStream', '1')
print(platform.getOpenMMVersion())
platform.setPropertyDefaultValue('Precision', 'mixed')

def load(run, filename):    
    with open(os.path.join(run, filename), 'rt') as infile:
        return openmm.XmlSerializer.deserialize(infile.read())

for run in runs:
    folder = run + "/01/"

    system = load(folder, 'system.xml')
    state = load(folder, 'state.xml')
    integrator = load(folder, 'integrator.xml')

    context = openmm.Context(system, integrator, platform)
    context.setState(state)

    integrator.step(10)
    state = context.getState(getEnergy=True)

    # print("Start simulation")
    initial_time = time.time()
    integrator.step(nsteps)
    state = context.getState(getEnergy=True)
    elapsed_time = (time.time() - initial_time) * unit.seconds
    time_per_step = elapsed_time / nsteps
    ns_per_day = (nsteps * integrator.getStepSize()) / elapsed_time / (unit.nanoseconds/unit.day)

    print(f'{run} {system.getNumParticles()} particles : {ns_per_day:.4f} ns/day')

I collected various projects over the years by doing the following:

Allow the client to download a project for a GPU slot and start running.
Pause the slot.
Create a directory where you're going to store work units and a add folder corresponding to the project number that was downloaded.
Go to the FAH Data Directory and find the correct folder in the work directory based on the Work Queue ID.
Copy the 01 subdirectory to the project number folder created in step 3.
If the state or system files have .bz2 extensions then extract them so you have state.xml and system.xml.
Add the project number into the runs variable in script.
Run python run-core-xml.py
Resume folding so the work unit completes and you get a new project. Then repeat the process.

ex-rzr commented 7 months ago

I've pushed a final (so far) commit. The limit is not changed (90000), I'm not sure how to tune it, considering how much it depends on parameters of the simulation, the GPU (its number of compute units) etc. Also I set NUM_TILES_IN_BATCH to 2 when large blocks are used, as I said in comments my hypothesis is "when numTilesInBatch warps process one block1, there is a smaller chance to have a few warps with extremely long durations". Though I haven't checked it yet. This looks like a good compromise. On all 3 GPUs I observe no regressions in standard benchmarks, minor improvements for cellulose and stmv, significant improvements for apoa1. Water box benchmarks are a bit slower than with NUM_TILES_IN_BATCH = 1. I wonder how it performs on your 6600.

There is one idea I want to check: large blocks cannot be used with blocks sorted by sizes (which is essential without large blocks for good performance). But what if we sort large blocks by size? Blocks within each large blocks stay unsorted.

bdenhollander commented 7 months ago

This version is throwing NaNs more often on my RX 6600. openmm.OpenMMException: Particle coordinate is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan. The previous version did once on the water box while this one did for 4 out of 12 water boxes on the first attempt. They're might be some uninitialized memory somewhere. I had noticed in the past that StreamHPC's develop branch was a bit less stable than the main AMD repo but thought it was a fluke since random number generators are in use. When I do get results, they are a bit lower than the previous version.

I would wait for an outcome from https://github.com/openmm/openmm/issues/4334 to see if there's a way to improve sorting in combination with large blocks.

ex-rzr commented 7 months ago

Thanks! I'll debug it, because it is definitely not right. By "the previous version" you mean the previous version of the large-blocks branch or of develop_stream branch? Did you see NaNs before? What is your ROCm version?

-- Update: what about tests? Do they pass?

bdenhollander commented 7 months ago

"Previous version" is the first implementation of large blocks. The first time I benchmarked it there was a NaN on one of the water boxes. A few months ago I tested the develop_stream branch and overall it just seemed less stable. I don't recall exactly what the issues were.

This is on Windows so only one version of ROCm available, 5.5.1, which is based on 5.5.0. I'm hoping ROCm 6.0 for Windows is included in AMD's December 6 event.

hipcc.bin.exe --version
HIP version: 5.5.0-3e8db564
clang version 17.0.0 (git@github.amd.com:Compute-Mirrors/llvm-project e3201662d21c48894f2156d302276eb1cf47c7be)
Target: x86_64-pc-windows-msvc
Thread model: posix
InstalledDir: C:\AMD\ROCm\5.5\bin

Singe precision test failures are sporadic. TestHipFFTImplHipFFTSingle and TestHipVariableLangevinIntegratorSingle fail consistently in Release builds.

Run 1
24 - TestHipFFTImplHipFFTSingle (Failed)
45 - TestHipVariableLangevinIntegratorSingle (Failed)
51 - TestHipAmoebaMultipoleForceSingle (Failed)
56 - TestHipRpmdSingle (Failed)
59 - TestHipDrudeNoseHooverSingle (Failed)

Run 2
24 - TestHipFFTImplHipFFTSingle (Failed)
30 - TestHipLangevinIntegratorSingle (Failed)
45 - TestHipVariableLangevinIntegratorSingle (Failed)

Run 3
24 - TestHipFFTImplHipFFTSingle (Failed)
45 - TestHipVariableLangevinIntegratorSingle (Failed)
46 - TestHipVariableVerletIntegratorSingle (Failed)
60 - TestHipDrudeSCFIntegratorSingle (Failed)

TestHipVariableLangevinIntegratorSingle passed a few times in Debug mode.

TestHipFFTImplHipFFTSingle fails on the first testTransform in Debug mode.

realToComplex: 0 xsize: 28 ysize: 25 zsize: 30
exception: Assertion failure at TestHipFFTImplHipFFT.cpp:108.  Expected 10545.2, found 173.451

ex-rzr commented 6 months ago

I ran standard benchmarks for 600 seconds and water box for 50 000 steps: no NaNs. MI100 (CDNA, gfx908) and V620 (RDNA, gfx1030), ROCm 5.7, Linux.

Can it be that the failures are caused by the runtime/driver? I hope the new version of ROCm for Windows will be released soon so we can see if it's related to these stability issues.

Debug mode

This is interesting. But kernels are always compiled -O3, hipcc adds it be default unless a different optimization level is passed. Is it something in host code? Or perhaps it's a synchronization issue? Does it pass with AMD_SERIALIZE_KERNEL=3 AMD_SERIALIZE_COPY=3?

set AMD_SERIALIZE_KERNEL=3
set AMD_SERIALIZE_COPY=3

(Meanwhile I'm experimenting with https://github.com/openmm/openmm/issues/4343, but it seems I also need to port one of changes in CudaSort, new keys are even less uniform than old block sizes, and this makes the sorting kernel a bottleneck)

ex-rzr commented 6 months ago

I've pushed to https://github.com/StreamHPC/openmm-hip/commits/large-blocks There are recent changes for large blocks (https://github.com/openmm/openmm/pull/4343 and https://github.com/openmm/openmm/pull/4351).

These changes provide significant improvement for large systems and none/negligible regressions for small systems (tested on gfx908 and gfx1030, ROCm 5.7). For example, stmv is 39 ns/day on MI100 and 38 ns/day on V620. I'm also going to check how the recently released ROCm 6.0 performs (likely the next weekend if I have time).

Regarding stability issues: benchmarks and tests work without NaNs. However sometimes I observe that some test may fail/hang. For example these test hang on MI100 every time

    150 - TestHipAmoebaGeneralizedKirkwoodForceDouble (Timeout)
    153 - TestHipAmoebaMultipoleForceDouble (Timeout)

But Single and Mixed pass.

TestHipNonbondedForceDouble hangs on V620 sometimes, and this happens here

:3:hip_module.cpp           :51  : 312586612980 us: [pid:1088244 tid:0x7f688867dd00]  hipModuleLoad ( 0x7ffe3be0a8a8, /home/rocm-user/openmm/tmp/dbcaa5a33e9ab019333ed3ab227a39668c41582d_gfx1030 ) 
:3:devprogram.cpp           :2684: 312586613339 us: [pid:1088244 tid:0x7f688867dd00] Using Code Object V5.

I.e. hipModuleLoad call doesn't return. Always this particular file, even if I remove it, and run 2 times: to compile and to load it from cache again. Sometimes it works, sometimes it hangs. It seems like a runtime issue, I wonder what triggers it. Again, maybe ROCm 6.0 does not have this issue.

amd / openmm-hip

Port large blocks enhancement #11