Different performance from test WU run on FAH and on openMM

ThWuensche commented 4 years ago

The last days I have played with the AMD HIP port of openMM on FAH test WUs from the 17102 test project on my Radeon VIIs. I have compared the ns/day results from the FAH benchmark with ns/day values from these systems run on openmm master (7.5) with HIP platform and openmm 7.4.2 according to the branch run in FAH core22.

That the results with platform HIP are different from those with platform openCL is logical. However I have also seen performance difference (of about 10%) in the comparision between the FAH reported ns/day values and the ns/day values on a local run of the system in openMM.

For example RUN10:

FAH benchmark results 15ns/day
openMM HIP 13.2ns/day
openMM OpenCL 12.2ns/day

or example RUN13:

FAH benchmark results 51ns/day
openMM HIP 47.1ns/day
openMM OpenCL 42.9ns/day

So it seems the results on openMM openCL are 10%-20% lower than these on FAH, which I don't understand. I would expect the opposite, since the runs on FAH include also checkpoints.

Would be good to understand the differences and achieve similar results in execution of the runs on openCL in FAH and openCL on local openMM. As long as there are significant differences effective benchmarks are not possible before integration of a new approach into a new FAH core, which is a big effort. Being able to run benchmarks in advance directly in openMM would be helpful to analyse performance effects of different changes.

This is the script I used, derived from the script to generate the 17101 (and probably 17102) test WUs:

from simtk import openmm, unit
import time
import os

template = """
<config>
 <numSteps v="{numSteps}"/>  
 <xtcFreq v="{xtcFreq}"/>
 <checkpointFreq v="{checkpointFreq}"/>
 <precision v="mixed"/>
 <xtcAtoms v="solute"/> 
</config>

"""

nsteps = 50000
wu_duration = 10*unit.minutes
ncheckpoints_per_wu = 4

from glob import glob
runs = glob('RUNS/17102*')
runs.sort()

platform = openmm.Platform.getPlatformByName('HIP')
print(platform.getOpenMMVersion())
platform.setPropertyDefaultValue('Precision', 'mixed')

def load(run, filename):    
    with open(os.path.join(run, filename), 'rt') as infile:        
        return openmm.XmlSerializer.deserialize(infile.read())

for run in runs:
    run = run + "/01/"
    print(run)

    # Read core.xml
    coredata = dict()    
    coredata['checkpointFreq'] = 0 #int(nsteps_per_wu / ncheckpoints_per_wu)
    coredata['numSteps'] = 0 #ncheckpoints_per_wu * coredata['checkpointFreq']
    coredata['xtcFreq'] = 0 #coredata['numSteps']

    system = load(run, 'system.xml')
    state = load(run, 'state.xml')
    integrator = load(run, 'integrator.xml')

    context = openmm.Context(system, integrator, platform)
    context.setState(state)

    initial_time = time.time()
    integrator.step(nsteps)
    state = context.getState()
    elapsed_time = (time.time() - initial_time) * unit.seconds
    time_per_step = elapsed_time / nsteps
    ns_per_day = (nsteps * integrator.getStepSize()) / elapsed_time / (unit.nanoseconds/unit.day)
    nsteps_per_wu = int(wu_duration / time_per_step)

    print(f'{run} {system.getNumParticles()} particles : {ns_per_day:.1f} ns/day : {coredata}')

ThWuensche commented 4 years ago

These are my benchmark results (based on PantherX nVidia charts), unfortunately on the first runs the charts disappeared:

FahCore_22 Benchmarking Charts_Radeon7.xlsx

ThWuensche commented 4 years ago

@peastman Peter, sorry for bothering you, but I'm out of ideas why the same system executed on a local build of openMM in all but one case is about 10% slower than if executed through F@H (I'm concluding that from ns/day information on F@H benchmark stats compared to the output of the benchmark generation script from @jchodera, hope those figures are comparable). The only exception is a system with very low atom count, where the AMD Radeon VII anyhow performs extremely bad. I've looked through the parameters in CMakeCache to check whether there are any optimizations or debug settings which could make a difference, but didn't find any.

Do you have an idea what could cause the difference and where I should search further. On the local build the system is executed through a python script (probably in contrast to F@H core22), but as most steps should be executed within openMM that probably should not create the difference, or am I wrong in that assumption?

peastman commented 4 years ago

So far as I know, they ought to be completely comparable. Your script pretty much looks fine. I would make just a few minor changes. I would add

integrator.step(10)
state = context.getState(getEnergy=True)

before you begin timing. As you currently have it, a lot of initialization (including compiling kernels) happens after you start timing. This will make sure all initialization is finished before then. I would also add getEnergy=True to the getState() call at the end of timing. If you don't ask it to send any data back from the GPU, it may return even if some kernels haven't finished executing yet.

ThWuensche commented 4 years ago

@peastman @jchodera

Thanks for your suggestion, Peter!

After the improvements suggested and after considering the F@H results from the logs on my machine, the difference gets much smaller:

RUN     F@H_logs      openMM_local     %diff  atom cnt
0          111,5             104,3       6,9     23558
1          112,6             105,2       7,0     23558
2          224,3             220,8       6,1     23558
3          282,4             263,7       7,1     23558
4           44,8              43,6       2,8     92224
5           34,6              33,6       3,0     92224
6           76,0              73,6       3,3     92224
7           89,4              86,6       3,2     92224
8           35,0              34,1       2,6     89705
9           93,5              89,5       4,7      4071
10       no data              12,2   no data    371771
11          10,4              10,3       1,2    448584
12         114,6             111,2       3,1     62180
13         44,85              43,6       2,9    182669
14          73,3              71,5       2,5    110730
15       no data           no data   no data     90725
16          49,7              46,9       6,0      4058

So to the higher difference in the first analysis contributed two systematic errors, the one pointed out by Peter and the fact that I had taken the figures from F@H stats, not from the reports of the F@H runs in the logs on my machine (where the stats may be higher, as there maximum is taken and that maximum may come from other machines with different setup). There is some systematic error left, as I've been running fixed 50000 steps, while the runs on F@H had different step counts based on estimated runtime on 2080ti.

So some difference is left, about 2,5% to 3% for larger systems, somewhat higher for smaller systems. The irregularity on RUN9 (F@H results lower than local openMM) did not show in the data from my logs. Anyhow the differences now seem so small, with the larger ones in small, quick projects, where step count could contribute, that for my side further analysis is not required. Unless you would be interested in further analysis I think the issue can be closed.

Sorry for bothering you and thanks for the help.

FoldingAtHome / openmm

Different performance from test WU run on FAH and on openMM #35