Considerations on project size (atom count) and AMD (GCN) performance

ThWuensche commented 4 years ago

Over the last time I have been trying to analyze the low performance of AMD GPUs (GCN) on small projects, trying to find improvements. Now I'm considering a possible reason for the observation of low performance on small atom count simulations. That observation - if confirmed - could help project owners to systematically decide which projects will have reasonable performance on AMD GCN-type GPUs and which not, thus helping to decide which projects to assign to different GPUs.

AMD GCN has a wavefront size of 64 threads, however these 64 threads are not executed on a full compute unit (CU) in parallel, but on one SIMD unit (out of 4 in a CU) sequentialized into four consecutive 16-thread parts, thus taking 4 cycles for a wavefront. In a CU there are 4 SIMD units, the Radeon VII (my case) has 60 CUs. As a conclusion, to occupy the GPU completely, not 3840 threads are required, as that would seem obvious (wavefront size 64 60 CUs), but 15360 threads (wavefront size 64 (4 SIMDs/CU) * 60 CUs) and needing 4 cycles. That construction makes the architecture very wide, wider than even the NVidia RTX3090, but executing that wide thread count not in one, but four cycles (effectively 3840 threads per cycle).

If my conclusion is correct, that would explain why AMD GCN devices are much more sensitive to small projects than even the widest NVidia GPUs. I have to confirm that I don't really know how NVidia handles it's warps, but as far as I understand, they are executed in parallel (not partially serialized like AMD GCN).

If we correlate the above thoughts with an excerpt of kernel call characteristics for a sample project 16921(79,24,68)

Executing Kernel computeNonbonded, workUnits 61440, blockSize 64, size 61440
Executing Kernel reduceForces, workUnits 21024, blockSize 128, size 21120
Executing Kernel integrateLangevinPart1, workUnits 21016, blockSize 64, size 21056
Executing Kernel applySettleToPositions, workUnits 6810, blockSize 64, size 6848
Executing Kernel applyShakeToPositions, workUnits 173, blockSize 64, size 192
Executing Kernel integrateLangevinPart2, workUnits 21016, blockSize 64, size 21056
Executing Kernel clearFourBuffers, workUnits 320760, blockSize 128, size 122880
Executing Kernel findBlockBounds, workUnits 21016, blockSize 64, size 21056
Executing Kernel sortShortList, workUnits 256, blockSize 256, size 256
Executing Kernel sortBoxData, workUnits 21016, blockSize 64, size 21056
Executing Kernel findBlocksWithInteractions, workUnits 21016, blockSize 256, size 21248
Executing Kernel updateBsplines, workUnits 21016, blockSize 64, size 21056
Executing Kernel computeRange, workUnits 256, blockSize 256, size 256
Executing Kernel assignElementsToBuckets, workUnits 21016, blockSize 64, size 21056
Executing Kernel computeBucketPositions, workUnits 256, blockSize 256, size 256
Executing Kernel copyDataToBuckets, workUnits 21016, blockSize 64, size 21056
Executing Kernel sortBuckets, workUnits 21120, blockSize 128, size 21120
Executing Kernel gridSpreadCharge, workUnits 21016, blockSize 64, size 21056
Executing Kernel finishSpreadCharge, workUnits 157464, blockSize 64, size 61440
Executing Kernel packForwardData, workUnits 78732, blockSize 64, size 61440
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel unpackForwardData, workUnits 78732, blockSize 64, size 61440
Executing Kernel gridEvaluateEnergy, workUnits 157464, blockSize 64, size 61440
Executing Kernel reciprocalConvolution, workUnits 157464, blockSize 64, size 61440
Executing Kernel packBackwardData, workUnits 78732, blockSize 64, size 61440
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel execFFT, workUnits 78732, blockSize 108, size 78732
Executing Kernel unpackBackwardData, workUnits 78732, blockSize 64, size 61440
Executing Kernel gridInterpolateForce, workUnits 21016, blockSize 64, size 21056
Executing Kernel computeBondedForces, workUnits 23455, blockSize 64, size 23488

we see that many kernels of that project run with 21016 threads (the project is specified with 21000 atoms in the project list). Assuming that not different kernels are executed in parallel, that means that the first 15360 threads will fully occupy the GPU, but the remaining 5656 threads only with about 1/3 of the capacity.

Other projects, like RUN9 of the benchmark projects (with 4071 atoms), will load the Radeon VII only at about 25% for many of the important kernels.

On the opposite in large projects the part of ineffective GPU load will create a small share only, that's where in general good performance of GCN devices is observed.

As a conclusion projects with slightly less than x * effective thread width (15360 for Radeon VII, gfx906) should run well, projects slightly above will run rather poor (specially if not covered by a very high atom count). So a project with 15000 atoms probably will run well, a project with 16000 atoms rather poor. For GPU use optimization that relation probably should be considered besides a general classification of a project to a group of GPUs.

Critical cross check of above thoughts is welcome!

peastman commented 4 years ago

Can you test it and see? You can use the script in https://github.com/openmm/openmm/issues/2875#issuecomment-704521525 to generate and time simulations with any number of atoms you want. Do you find a large change between 15,000 and 16,000?

As a conclusion, to occupy the GPU completely, not 3840 threads are required, as that would seem obvious (wavefront size 64 60 CUs), but 15360 threads (wavefront size 64 (4 SIMDs/CU) * 60 CUs) and needing 4 cycles.

Ideally you want even more than that. GPUs rely on having lots of threads that they can switch between to cover latency. If your number of threads exactly matches the width of the GPU, each compute unit will be busy until it needs to load something from global memory, and then it will stall for a few hundred clock cycles while waiting for the data to arrive. You try to cover this by having other threads it can switch to while the first thread is stuck waiting for data. When possible, we set the number of workgroups to a multiple of the number of compute units:

https://github.com/FoldingAtHome/openmm/blob/9998958af39c657d26da68aad0177fb8c45180a1/platforms/opencl/src/OpenCLContext.cpp#L236-L241

In the case above, though, we cap it at a lower value because the number of atoms is so low. There just isn't enough work to fill that many workgroups.

Efficiently using a high end GPU to simulate a small system is always going to be challenging (and beyond a certain point, impossible). It's best if we can target the small projects to low end GPUs, keeping the high end GPUs free for the large systems that make good use of them.

ThWuensche commented 4 years ago

Can you test it and see? You can use the script in openmm#2875 (comment) to generate and time simulations with any number of atoms you want. Do you find a large change between 15,000 and 16,000?

After writing that, it sprang to my mind that I could check that with the figures we had from our test runs. Interestingly it showed up different, from 15000 to 16000 actually the overall time dropped. Just trying to analyze that with traces from the runs, don't understand it yet.

Ideally you want even more than that. GPUs rely on having lots of threads that they can switch between to cover latency. If your number of threads exactly matches the width of the GPU, each compute unit will be busy until it needs to load something from global memory, and then it will stall for a few hundred clock cycles while waiting for the data to arrive. You try to cover this by having other threads it can switch to while the first thread is stuck waiting for data. When possible, we set the number of workgroups to a multiple of the number of compute units:

I have just studied the threads you put into the different kernels and wondered why nonbonded calculation had a fixed size. Understood that you choose 4*15360 as fixed size. The reduced time for 16000 compared to 15000 actually might result from the fact that - with onefold occupation only - the addition to 16000 somewhat reduces the starving. On higher multiples of 15360 that effect does not show, in that table we have a serious increase in time when we exceed triple occupancy:

45792 | 1431 | 5.13 | 4.98 50442 | 1577 | 6.13 | 5.99

So probably in general the idea was right, but I did not consider the "starved" issue. Probably exceeding 15360 first comes at no cost at all, since the additional work is done while the GPU anyhow would be waiting. On higher multiples, when the GPU would not loose a lot of cycles waiting, the step beyond a multiple of the effective thread size really adds additional time and reduces efficiency.

Efficiently using a high end GPU to simulate a small system is always going to be challenging (and beyond a certain point, impossible). It's best if we can target the small projects to low end GPUs, keeping the high end GPUs free for the large systems that make good use of them.

Agreed, but I see many small projects (like those example with 21000 atoms) assigned to my GPUs, delivering low performance. And with GPUs getting larger and larger, probably F@H will not have enough small GPUs for some projects in the future. That then will be the time to run simulations in parallel.

Sorry, if as a newbie I come up with ideas that you for long time know are nonsense. But sometimes it still seems helpful to keep eyes open and put in ideas, so just let's discard those that are nonsense.

ThWuensche commented 4 years ago

I'm just trying to understand the traces and why according to them about 1/3 of the time the GPU is idle, kernels are active only 2/3 of the time. First I considered it would be kernel schedule latency, idle time after one kernel is finished until the next is started. However if I look at the time CompleteNS - DispatchNS of the next kernel, it seems that a number of kernels are dispatched ahead, before the one before completes, since this difference is negative - that's the column before StartLat. Still there are (in most cases) about 4us between end of one kernel and start of the next: column Idle is the difference between BeginNS and EndNS of the kernel before.

grafik

Or are kernels dispatched in advance by the application, but the runtime system (maybe in the linux kernel driver) waits until a kernel finished before it executes the next one, which had been scheduled earlier by the application?

peastman commented 4 years ago

Probably exceeding 15360 first comes at no cost at all, since the additional work is done while the GPU anyhow would be waiting. On higher multiples, when the GPU would not loose a lot of cycles waiting, the step beyond a multiple of the effective thread size really adds additional time and reduces efficiency.

That sounds right.

Still there are (in most cases) about 4us between end of one kernel and start of the next

If so, they've improved! AMD used to have an overhead of about 10 us for each kernel. That may still be the case on Windows, which has higher overhead than Linux. This is why we try to minimize the number of separate kernel launches.

Agreed, but I see many small projects (like those example with 21000 atoms) assigned to my GPUs, delivering low performance.

Yes, we definitely need to do something about that!

ThWuensche commented 4 years ago

BTW, if it is of interest to you, here are stats for runs with width 5.25 (longer runtime), the run with atom count just below the limit, and with width 5.5 (shorter runtime), just above the limit. The difference lies mostly in computeNonbonded, which runs with fixed thread size.

screenshot_w5_25 screenshot_w5_5

ThWuensche commented 4 years ago

The test with higher granularity. In general it shows over proportional increase in time when the atom size crosses a multiple of the maximum thread size of the GPU. However it is lower than expected and has an irregularity at the first step, where it would be most critical. However that irregularity seems to come from calculateNonbonded, which has fixed thread size and as such does not look related to that point. In general some disturbances/noise in the measured values, like for example on atom count 113958, time deltas are not "smooth".

List 1	List 2	Time	dTime
12255	383	2,91248488426208
12990	406	2,88024377822876	-0,032241106033325
13836	433	3,05391716957092	0,173673391342163
14514	454	3,02331209182739	-0,03060507774353
15477	484	2,71697306632996	-0,306339025497437
16272	509	2,79023265838623	0,073259592056275
17178	537	2,82571578025818	0,035483121871948
18138	567	2,97455453872681	0,148838758468628
19026	595	3,0194935798645	0,044939041137695
20340	636	3,12287950515747	0,103385925292969
21384	669	3,20537495613098	0,082495450973511
21480	672	3,13328719139099	-0,07208776473999
23568	737	3,30896997451782	0,175682783126831
24498	766	3,38099765777588	0,072027683258057
25572	800	3,47666311264038	0,095665454864502
26964	843	3,50756669044495	0,030903577804566
28179	881	3,61467909812927	0,107112407684326
29424	920	3,70178055763245	0,087101459503174
30789	963	3,94584131240845	0,244060754776001
32325	1011	4,00474500656128	0,058903694152832
33840	1058	4,01952767372131	0,014782667160034
35211	1101	4,11354827880859	0,09402060508728
36708	1148	4,28623414039612	0,172685861587524
38217	1195	4,39107799530029	0,104843854904175
39825	1245	4,42315292358398	0,032074928283691
41358	1293	4,66109704971314	0,23794412612915
43305	1354	4,88815975189209	0,227062702178955
44958	1405	4,89233231544495	0,004172563552856
46629	1458	5,56133365631104	0,669001340866089
48441	1514	5,58660101890564	0,025267362594605
50415	1576	5,90663361549377	0,320032596588135
52341	1636	6,04964709281921	0,143013477325439
54444	1702	6,23829650878906	0,188649415969849
56142	1755	6,32623052597046	0,087934017181397
58515	1829	6,49991869926453	0,173688173294067
60414	1888	6,56362867355347	0,063709974288941
62562	1956	6,86830520629883	0,304676532745361
64830	2026	6,90527391433716	0,03696870803833
66906	2091	7,02717351913452	0,121899604797363
69921	2186	7,35353469848633	0,326361179351807
72279	2259	7,55564570426941	0,202111005783081
72495	2266	7,51527237892151	-0,0403733253479
77148	2411	7,8772234916687	0,361951112747192
79191	2475	7,98567771911621	0,10845422744751
81540	2549	8,17700910568237	0,191331386566162
84534	2642	8,2921895980835	0,115180492401123
87087	2722	8,45307517051697	0,160885572433472
89712	2804	8,62603664398193	0,172961473464966
92592	2894	9,14751052856445	0,52147388458252
95766	2993	9,43603682518005	0,288526296615601
98880	3090	9,76397943496704	0,327942609786987
101697	3179	10,0979392528534	0,333959817886353
104721	3273	10,1487267017365	0,050787448883057
107736	3367	10,4614281654358	0,312701463699341
110937	3467	10,7668445110321	0,305416345596313
113958	3562	11,3086459636688	0,541801452636719
117774	3681	11,4619560241699	0,153310060501099
120948	3780	11,6951687335968	0,23321270942688
124188	3881	12,1348485946655	0,439679861068726
127650	3990	12,4355757236481	0,300727128982544

ThWuensche commented 4 years ago

Could an out-of-order command queue with explicit synchronization avoid the gap between the execution of subsequent kernels? Is that worth experimenting with? If the gaps could be avoided, for short running kernels theoretically about 50% (or more) of additional time could be achieved, execution time cut by 1/3 to 1/2. Or do I miss something?

peastman commented 4 years ago

Could an out-of-order command queue with explicit synchronization avoid the gap between the execution of subsequent kernels?

The gap reflects the work the driver and GPU have to do to set up a new kernel. It's independent of any synchronization between kernels.

CUDA 10 introduced a new mechanism that can precompute some of that work to reduce the overhead. See https://developer.nvidia.com/blog/cuda-graphs. Some day when we move to CUDA 10 as our minimum supported version I hope to be able to make use of it. But there's no similar mechanism in OpenCL.

ThWuensche commented 4 years ago

According to documentation in out-of-order execution kernels should run in parallel, so the next kernel might already be set up and started while the previous is still running, thus closing/avoiding that gap. But I tried and on my platform it does not seem to work, kernels are not started in parallel, but one after the other :(.

bdenhollander commented 4 years ago

@ThWuensche Your trace has a significant amount of time in findBlocksWithInteractions. Can you try changing 32 to 64 in findInteractingBlocks.cl to check if performance is better or worse?

#if SIMD_WIDTH <= 32

ThWuensche commented 4 years ago

That makes it spend even more time there. And more of a question, also the time in computeNonbonded is increasing. How could that be related to the blocksize in another kernel, should not have influence on the data? Probably worth to analyze, will see whether I'll understand what's going on. Does that buffer size have influence on the actual data, like more data to be computed in computeNonbonded?

Then in that kernel a lot is fixed on "32", also things depending on local_index. That probably lets two "warps" go into a "wavefront".

grafik

bdenhollander commented 4 years ago

That makes it spend even more time there. And more of a question, also the time in computeNonbonded is increasing. How could that be related to the blocksize in another kernel, should not have influence on the data? Probably worth to analyze, will see whether I'll understand what's going on. Does that buffer size have influence on the actual data, like more data to be computed in computeNonbonded?

Changing the if makes GCN use the same code path as NVIDIA and Navi. I wanted to confirm that the comments around L277 are still accurate for wide GCN GPUs.

Buffer size might be optimized to fit within local memory for NVIDIA, which is larger than GCN.

CL_DEVICE_LOCAL_MEM_SIZE
AMD GCN: 32KB
NVIDIA: 48KB
AMD Navi: 64KB

ThWuensche commented 4 years ago

Just saw that when I looked into the code. It's not only size of that buffer, but selects a completely different implementation.

But why does that have influence on computeNonbonded?

Regarding the local memory LDS (local data share) on my version of GCN should be 64KB, don't know whether it was less on older GCN versions.

ThWuensche commented 4 years ago

The other question is why these calculations (findBlocksWithInteractions and computeNonbonded) take more time for lower atom counts than for higher atom counts.

bdenhollander commented 4 years ago

But why does that have influence on computeNonbonded?

I don't know enough about what's happening internally to know whether the kernels are stalling each other or the results of the two algorithms are slightly different because of things like repeated multiplications or partial sums.

Regarding the local memory LDS (local data share) on my version of GCN should be 64KB, don't know whether it was less on older GCN versions.

CompuBench reports 32KB for Radeon VII. RX 460 (GCN 4.0) and Spectre R7 Graphcis APU (GCN 2.0) have 32KB as well.

bdenhollander commented 4 years ago

The other question is why these calculations (findBlocksWithInteractions and computeNonbonded) take more time for lower atom counts than for higher atom counts.

As a percentage of total execution time or average wall duration per call?

ThWuensche commented 4 years ago

CompuBench reports 32KB for Radeon VII. RX 460 (GCN 4.0) and Spectre R7 Graphcis APU (GCN 2.0) have 32KB as well.

Think they got something wrong. My Radeon VII is gfx906, that's part of the output of clinfo:

Local memory size:                             65536

...

Name:                                          gfx906
Vendor:                                        Advanced Micro Devices, Inc.

ThWuensche commented 4 years ago

As a percentage of total execution time or average wall duration per call?

Both, I think. The simulated system has been created by a script from Peter Eastman in that issue. The traces above are from two runs of that script with atom count and run time according to lines 2 and 3 in the table, the run with the bigger atom count has about 8% lower runtime, correlating with the lower time for these two kernels in the traces.

bdenhollander commented 4 years ago

Think they got something wrong. My Radeon VII is gfx906, that's part of the output of clinfo:
Local memory size:                             65536
Interesting, is that the value for CL_DEVICE_LOCAL_MEM_SIZE or CL_DEVICE_LOCAL_MEM_SIZE_PER_COMPUTE_UNIT_AMD?

RX 460:

Local memory size                               32768 (32KiB)
Local memory syze per CU (AMD)                  65536 (64KiB)

This guide mentions LDS is 64KB with a maximum allocation of 32KB per workgroup.

bdenhollander commented 4 years ago

Found this tidbit about Vega LDS size:

[C]urrently Vega has 64 KB local/shared memory enabled on Linux, but 32 KB on Windows.

https://community.amd.com/thread/246353

ThWuensche commented 4 years ago

I meant LDS and LDS is per CU. I was not aware of the limitation 32KB per workgroup.

I just tried, I can define a buffer:

__local uint[8192+4096]

and the compiler does not complain.

            for (j=8192+4096-1; j>=0; j--)
                buffer[j] = (uint)j;
            out[0] = buffer[8192+4096-1];

out[0] comes up with 0x2fff, so guess I have more than 32KB.

That is with ROCm 3.7 on Linux.

peastman commented 4 years ago

Changing the if makes GCN use the same code path as NVIDIA and Navi. I wanted to confirm that the comments around L277 are still accurate for wide GCN GPUs.

They're still correct. On a GPU with a SIMD width of 64, that kernel has severe thread divergence.

And more of a question, also the time in computeNonbonded is increasing.

I don't know of any reason that would be affected.

ThWuensche commented 4 years ago

I'm just reading computeNonbonded and thinking about thread divergence. There is the distinction between tiles on diagonal and off diagonal. If on a wavefront size of 64 two tiles are in one wavefront and one is on diagonal, one off diagonal, the wavefront probably would see rather large thread divergence, is that correct?

ThWuensche commented 4 years ago

I don't know of any reason that would be affected.

The traces above seem to indicate effect onto compute time for computeNonbonded depending on that selection. Not to make false conclusions will recheck.

peastman commented 4 years ago

If on a wavefront size of 64 two tiles are in one wavefront and one is on diagonal, one off diagonal, the wavefront probably would see rather large thread divergence, is that correct?

Correct. That's why we sort the tiles differently on GPUs with SIMD width 64, to put all the diagonal ones together and all the off diagonal ones together.

https://github.com/FoldingAtHome/openmm/blob/9998958af39c657d26da68aad0177fb8c45180a1/platforms/opencl/src/OpenCLNonbondedUtilities.cpp#L175-L187

ThWuensche commented 4 years ago

I don't know of any reason that would be affected.

The traces above seem to indicate effect onto compute time for computeNonbonded depending on that selection. Not to make false conclusions will recheck.

Confirmed that it does make a difference. First trace is with standard findInteraction for AMD, second with algorithm for other platforms. Execution time of computeNonbonded gets significantly longer.

grafik grafik

ThWuensche commented 4 years ago

Correct. That's why we sort the tiles differently on GPUs with SIMD width 64, to put all the diagonal ones together and all the off diagonal ones together.

Thanks for the explanation. Had noticed that difference in the sort, but did not understand the meaning/consequences.

ThWuensche commented 4 years ago

About the results from the test script: What gives me concerns is that from width 5.25 to width 5.5 the execution time drops by 0.2s, even though the system size increases. That difference comes just from these two kernels. That related with above effect leaves a question mark for me.

peastman commented 4 years ago

Try modifying the line in the script that creates the context to instead be

context = Context(system, integrator, Platform.getPlatformByName('OpenCL'), {'DisablePmeStream':'true'})

After you do that, do you still see a difference in the speed of the nonbonded kernel?

ThWuensche commented 4 years ago

Difference remains: grafik grafik I will try to include trace markers to find where it happens. Edit: Markers in roctracer seem to be related to host code, not kernel code. So probably that way does not exist, don't know how to extract runtime of parts of kernel code, like runtime of a loop.

ThWuensche commented 4 years ago

I'm searching all day, but without success: Is there any way to read the openCL device timestamp (or another timestamp, counter) directly on the device, so that it can in an easy way be used to save timestamps reflecting the progress of a kernel (like one timestamp saved before a loop, another saved after end of the loop)?

Only thing I find are events, including profiling events, but that does not seem to be an easy way to get timestamps during the progress through a kernel.

Edit: Searching how to access S_MEMTIME

Ok, seems I managed that with some inline assembler. Will test runtime of the two loops now.

ThWuensche commented 4 years ago

Ok, here is the result from adding the outputs of in-kernel time measurements for the first and second loop of computeNonbonded. Interesting, the difference in accumulated time of computeNonbonded between the two sort variants, which was visible in the trace, is (almost) not replicated in that output. However the difference in the total time, where the smaller system takes much more time than the larger, is repeated. The difference stems from the second loop.

I have added the modified nonbonded.cl (renamed to .txt) for verification of what I have done/calculated.

Lists	interactAMD_loop1	interactAMD_loop2	interactSTD_loop1	interactStd_loop2
14190/444	58592	9843306	60444	10080088
16284/509	62540	6025212	62732	6189640

nonbonded.txt

ThWuensche commented 4 years ago

Could the longer time on smaller system be related to efficiency effects of "singlePeriodicCopy" code diversion?

Edit: Seems like that, on the first, smaller system singlePeriodicCopy is 0, on the larger system, which runs faster, it is 1. So the fact that the lower runtime occurs right in that region, where the number of atoms crosses the max number of threads on the Radeon VII, was by chance, but the later is not the reason for the lower runtime. With that double influence hard to say how the speed of the simulation changes by the atom count jumping over the number of concurrent threads (the original question of the thread).

Side note: I just realized that the number of times the second loop is executed is 7 for the smaller system, 5 for the larger system. 5 I understand, 7 not. (5 resulting as 32 * (16248 atoms / 32)^2 / (4*15360), with 32 being tilesize, 15360 total number of SIMD lanes ???).

ThWuensche commented 4 years ago

In nonbonded.cl:

That probably should never occur, handled on higher level?

    unsigned int numTiles = interactionCount[0];
    if (numTiles > maxTiles)
        return; // There wasn't enough memory for the neighbor list.

If it would occur, interactions would not be calculated, right?

bdenhollander commented 4 years ago

Try modifying the line in the script that creates the context to instead be
context = Context(system, integrator, Platform.getPlatformByName('OpenCL'), {'DisablePmeStream':'true'})

Will that actually do anything on AMD? PmeStream seems to be NVIDIA only.

usePmeQueue = (!cl.getPlatformData().disablePmeStream && isNvidia);

ThWuensche commented 4 years ago

Would that version work and reduce thread divergence?

In findInteractingBlocks.cl:

/**
 * Perform a parallel prefix sum over an array.  The input values are all assumed to be 0 or 1.
 */
void prefixSum(__local short* sum, __local ushort2* temp) {
    for (int i = get_local_id(0); i < BUFFER_SIZE; i += get_local_size(0))
        temp[i].x = sum[i];
    barrier(CLK_LOCAL_MEM_FENCE);
    int whichBuffer = 0;
    for (int offset = 1; offset < BUFFER_SIZE; offset *= 2) {
#if 0
        if (whichBuffer == 0)
            for (int i = get_local_id(0); i < BUFFER_SIZE; i += get_local_size(0))
                temp[i].y = (i < offset ? temp[i].x : temp[i].x+temp[i-offset].x);
        else
            for (int i = get_local_id(0); i < BUFFER_SIZE; i += get_local_size(0))
                temp[i].x = (i < offset ? temp[i].y : temp[i].y+temp[i-offset].y);
        whichBuffer = 1-whichBuffer;
#endif
        for (int i = get_local_id(0); i < BUFFER_SIZE; i += get_local_size(0))
            if (whichBuffer == 0)
                temp[i].y = (i < offset ? temp[i].x : temp[i].x+temp[i-offset].x);
            else
                temp[i].x = (i < offset ? temp[i].y : temp[i].y+temp[i-offset].y);
        whichBuffer = 1-whichBuffer;

        barrier(CLK_LOCAL_MEM_FENCE);
    }
#if 0
    if (whichBuffer == 0)
        for (int i = get_local_id(0); i < BUFFER_SIZE; i += get_local_size(0))
            sum[i] = temp[i].x;
    else
        for (int i = get_local_id(0); i < BUFFER_SIZE; i += get_local_size(0))
            sum[i] = temp[i].y;
#endif
    for (int i = get_local_id(0); i < BUFFER_SIZE; i += get_local_size(0))
        sum[i] = whichBuffer?temp[i].y:temp[i].x;

    barrier(CLK_LOCAL_MEM_FENCE);
}

Edit: I benchmarked that, but don't see improvements. Maybe the compiler optimizes the standard version - or there is no improvement in the new code.

peastman commented 4 years ago

Question is, whether the short times result from the fact that the kernel returns immediately

Correct.

Does the kernel have to be called in that case at all or could the call overhead be eliminated (or does that value come from another kernel, so that the host does not know whether is has to be scheduled - however could not find that being used in any other place)?

Exactly. It first gets set to 0 in findBlockBounds(). If any atom has moved far enough to require the neighbor list to be rebuilt, it gets set to 1 in sortBoxData().

bdenhollander commented 4 years ago

Could the longer time on smaller system be related to efficiency effects of "singlePeriodicCopy" code diversion?

Edit: Seems like that, on the first, smaller system singlePeriodicCopy is 0, on the larger system, which runs faster, it is 1. So the fact that the lower runtime occurs right in that region, where the number of atoms crosses the max number of threads on the Radeon VII, was by chance, but the later is not the reason for the lower runtime. With that double influence hard to say how the speed of the simulation changes by the atom count jumping over the number of concurrent threads (the original question of the thread).

Side note: I just realized that the number of times the second loop is executed is 7 for the smaller system, 5 for the larger system. 5 I understand, 7 not. (5 resulting as 32 * (16248 atoms / 32)^2 / (4*15360), with 32 being tilesize, 15360 total number of SIMD lanes ???).

I see a similar saw tooth between 14514 and 15477 on my RX 460, which has 896 shaders. I also see one between 3762 and 4158. (Times are using the script but with 600 steps instead of 3000 to compensate for less than 1/5 FLOPS.)

ThWuensche commented 4 years ago

I see a similar saw tooth between 14514 and 15477 on my RX 460, which has 896 shaders. I also see one between 3762 and 4158.

I think the first sawtooth might come from the search changing from sortShortList to general sort algorithm. For me that was at higher system size, but it also depends on shader count. Did you apply the patch?

Actually I now wonder whether the "max" in that code line is correct, shouldn't it be a min, so that local memory is not exceeded?

int maxShortList = max(maxLocalBuffer, (int) OpenCLContext::ThreadBlockSize*context.getNumThreadBlocks());

But should that decision also depend on which of the two sortShort implementations is used further on? sortShortList2 depends on the number of maximum parallel threads, while sortShortList anyhow is limited to one CU, so depends only on the memory limit, not on number of available parallel threads.

The second sawtooth seems to come from the change of algorithm according to singlePeriodicCopy. I included the following debug output in nonbonded.cl (at the end of the second loop):

        }
        pos++;

    if (get_global_id(0) == 0)
        printf("loopcnt %d, singlePeriodic %d\n",loopcnt,singlePeriodicCopy);

    }

If you do that, what does it show?

BTW, the way I found to get timestamps at different places in the code is based on that piece of inline assembler:

    if (get_global_id(0) == 0)
    __asm__ volatile ("s_memrealtime %0\n s_waitcnt lgkmcnt(0)" : "=s" (end_time) );

where end_time is a variable in kernel: unsigned long end_time; There also is another counter s_memtime, which is not clock compensated.

That's of course not portable, NVidia probably would have a problem with it. Don't know whether that also works on the RX460.

peastman commented 4 years ago

Actually I now wonder whether the "max" in that code line is correct, shouldn't it be a min, so that local memory is not exceeded?

That needs cleaning up a little. The max is correct, but it should only be done on NVIDIA. That's the maximum size to efficiently use the sortShortList2() kernel.

ThWuensche commented 4 years ago

Actually I now wonder whether the "max" in that code line is correct, shouldn't it be a min, so that local memory is not exceeded?

That needs cleaning up a little. The max is correct, but it should only be done on NVIDIA. That's the maximum size to efficiently use the sortShortList2() kernel.

But shouldn't it still be min, limiting the threads either by memory or SIMD lanes, whatever is lower? As is, if SIMD units are more than memory allows, that would exceed memory ressources. Of course there later is the try/catch, which would probably fall back to standard anyhow, if memory can not be allocated. But maybe I miss something.

peastman commented 4 years ago

sortShortList2() doesn't use local memory. If we're going to use that kernel, the amount of memory doesn't matter.

ThWuensche commented 4 years ago

Ok, thanks, so the first term in the max() is actually not relevant for sortShortList2 and the second not relevant for sortShortList.

ThWuensche commented 4 years ago

I know I rather spammed the channel, but would like to come back to this question, since if not catched in another place it could lead to wrong calculations depending on system size.

ThWuensche commented 4 years ago

I just tried to run RUN11 (the big system with 448584 atoms) from F@H benchmarks again on two GPUs in parallel, but get NaN, while with one GPU it is working:

Traceback (most recent call last):
  File "run-core-xml.py", line 56, in <module>
    integrator.step(nsteps)
  File "/usr/local/lib/python3.7/dist-packages/simtk/openmm/openmm.py", line 7658, in step
    return _openmm.LangevinIntegrator_step(self, steps)
simtk.openmm.OpenMMException: Particle coordinate is nan

Trying again I get:

Start simulation
Memory access fault by GPU node-1 (Agent handle: 0x1dae860) on address 0x7f2ea47b6000. Reason: Unknown.
Abgebrochen

Here is the trace data for the case with the Memory access fault: results.txt

Not important for me, since I'm mostly looking at the use in F@H, which does not use multiple GPUs in one work unit, but still wanted to mention. If you need more tracing, please let me know.

peastman commented 4 years ago

If it would occur, interactions would not be calculated, right?

Correct. That indicates we haven't allocated enough memory on the GPU to hold the neighbor list. In that case we detect the problem on the host side and allocate larger arrays.

https://github.com/FoldingAtHome/openmm/blob/9998958af39c657d26da68aad0177fb8c45180a1/platforms/opencl/src/OpenCLNonbondedUtilities.cpp#L386-L392

We then set a couple of flags:

https://github.com/FoldingAtHome/openmm/blob/9998958af39c657d26da68aad0177fb8c45180a1/platforms/opencl/src/OpenCLNonbondedUtilities.cpp#L419-L420

The first one ensures the neighbor list will be rebuild on the next attempt, and the second one informs the Context that something went wrong in computing the forces and it should try again.

ThWuensche commented 4 years ago

Ok, thanks, as usual you considered everything! Just mentioned, since this is the kind of problems easily to overlook, quick fix where it hurts ... (speaking of own experience) - so I'm used to bring up such things when I see them.

Still would be good to have a feedback for such errors out of the kernel, instead of just not calculating. But that is not so easy due to asynchronous nature of kernel dispatches. Maybe some error report structure in memory (just as proposal for error handling for the future).

peastman commented 4 years ago

Absolutely! It actually took a while to arrive at this solution. Originally if there wasn't enough memory for the neighbor list, it would go ahead and compute the forces with a less efficient algorithm that didn't use the neighbor list (and also set a flag to tell us to allocate more memory for the next step). This could lead to taking multiple seconds to execute that one kernel. On Linux that was fine, but on Windows that OS would kill the process.

Still would be good to have a feedback for such errors out of the kernel

This isn't really an error. At the start of a simulation we have little idea how much memory the neighbor list will take up. So we just make a guess, and if the guess happens to be too low we'll discover that right away and allocate more memory. This is part of the normal functioning of the kernel.

FoldingAtHome / openmm

Considerations on project size (atom count) and AMD (GCN) performance #36