Open ThWuensche opened 4 years ago
Can you test it and see? You can use the script in https://github.com/openmm/openmm/issues/2875#issuecomment-704521525 to generate and time simulations with any number of atoms you want. Do you find a large change between 15,000 and 16,000?
As a conclusion, to occupy the GPU completely, not 3840 threads are required, as that would seem obvious (wavefront size 64 60 CUs), but 15360 threads (wavefront size 64 (4 SIMDs/CU) * 60 CUs) and needing 4 cycles.
Ideally you want even more than that. GPUs rely on having lots of threads that they can switch between to cover latency. If your number of threads exactly matches the width of the GPU, each compute unit will be busy until it needs to load something from global memory, and then it will stall for a few hundred clock cycles while waiting for the data to arrive. You try to cover this by having other threads it can switch to while the first thread is stuck waiting for data. When possible, we set the number of workgroups to a multiple of the number of compute units:
In the case above, though, we cap it at a lower value because the number of atoms is so low. There just isn't enough work to fill that many workgroups.
Efficiently using a high end GPU to simulate a small system is always going to be challenging (and beyond a certain point, impossible). It's best if we can target the small projects to low end GPUs, keeping the high end GPUs free for the large systems that make good use of them.
Can you test it and see? You can use the script in openmm#2875 (comment) to generate and time simulations with any number of atoms you want. Do you find a large change between 15,000 and 16,000?
After writing that, it sprang to my mind that I could check that with the figures we had from our test runs. Interestingly it showed up different, from 15000 to 16000 actually the overall time dropped. Just trying to analyze that with traces from the runs, don't understand it yet.
Ideally you want even more than that. GPUs rely on having lots of threads that they can switch between to cover latency. If your number of threads exactly matches the width of the GPU, each compute unit will be busy until it needs to load something from global memory, and then it will stall for a few hundred clock cycles while waiting for the data to arrive. You try to cover this by having other threads it can switch to while the first thread is stuck waiting for data. When possible, we set the number of workgroups to a multiple of the number of compute units:
I have just studied the threads you put into the different kernels and wondered why nonbonded calculation had a fixed size. Understood that you choose 4*15360 as fixed size. The reduced time for 16000 compared to 15000 actually might result from the fact that - with onefold occupation only - the addition to 16000 somewhat reduces the starving. On higher multiples of 15360 that effect does not show, in that table we have a serious increase in time when we exceed triple occupancy:
45792 | 1431 | 5.13 | 4.98 50442 | 1577 | 6.13 | 5.99
So probably in general the idea was right, but I did not consider the "starved" issue. Probably exceeding 15360 first comes at no cost at all, since the additional work is done while the GPU anyhow would be waiting. On higher multiples, when the GPU would not loose a lot of cycles waiting, the step beyond a multiple of the effective thread size really adds additional time and reduces efficiency.
Efficiently using a high end GPU to simulate a small system is always going to be challenging (and beyond a certain point, impossible). It's best if we can target the small projects to low end GPUs, keeping the high end GPUs free for the large systems that make good use of them.
Agreed, but I see many small projects (like those example with 21000 atoms) assigned to my GPUs, delivering low performance. And with GPUs getting larger and larger, probably F@H will not have enough small GPUs for some projects in the future. That then will be the time to run simulations in parallel.
Sorry, if as a newbie I come up with ideas that you for long time know are nonsense. But sometimes it still seems helpful to keep eyes open and put in ideas, so just let's discard those that are nonsense.
I'm just trying to understand the traces and why according to them about 1/3 of the time the GPU is idle, kernels are active only 2/3 of the time. First I considered it would be kernel schedule latency, idle time after one kernel is finished until the next is started. However if I look at the time CompleteNS - DispatchNS of the next kernel, it seems that a number of kernels are dispatched ahead, before the one before completes, since this difference is negative - that's the column before StartLat. Still there are (in most cases) about 4us between end of one kernel and start of the next: column Idle is the difference between BeginNS and EndNS of the kernel before.
Or are kernels dispatched in advance by the application, but the runtime system (maybe in the linux kernel driver) waits until a kernel finished before it executes the next one, which had been scheduled earlier by the application?
Probably exceeding 15360 first comes at no cost at all, since the additional work is done while the GPU anyhow would be waiting. On higher multiples, when the GPU would not loose a lot of cycles waiting, the step beyond a multiple of the effective thread size really adds additional time and reduces efficiency.
That sounds right.
Still there are (in most cases) about 4us between end of one kernel and start of the next
If so, they've improved! AMD used to have an overhead of about 10 us for each kernel. That may still be the case on Windows, which has higher overhead than Linux. This is why we try to minimize the number of separate kernel launches.
Agreed, but I see many small projects (like those example with 21000 atoms) assigned to my GPUs, delivering low performance.
Yes, we definitely need to do something about that!
BTW, if it is of interest to you, here are stats for runs with width 5.25 (longer runtime), the run with atom count just below the limit, and with width 5.5 (shorter runtime), just above the limit. The difference lies mostly in computeNonbonded, which runs with fixed thread size.
The test with higher granularity. In general it shows over proportional increase in time when the atom size crosses a multiple of the maximum thread size of the GPU. However it is lower than expected and has an irregularity at the first step, where it would be most critical. However that irregularity seems to come from calculateNonbonded, which has fixed thread size and as such does not look related to that point. In general some disturbances/noise in the measured values, like for example on atom count 113958, time deltas are not "smooth".
List 1 | List 2 | Time | dTime |
---|---|---|---|
12255 | 383 | 2,91248488426208 | |
12990 | 406 | 2,88024377822876 | -0,032241106033325 |
13836 | 433 | 3,05391716957092 | 0,173673391342163 |
14514 | 454 | 3,02331209182739 | -0,03060507774353 |
15477 | 484 | 2,71697306632996 | -0,306339025497437 |
16272 | 509 | 2,79023265838623 | 0,073259592056275 |
17178 | 537 | 2,82571578025818 | 0,035483121871948 |
18138 | 567 | 2,97455453872681 | 0,148838758468628 |
19026 | 595 | 3,0194935798645 | 0,044939041137695 |
20340 | 636 | 3,12287950515747 | 0,103385925292969 |
21384 | 669 | 3,20537495613098 | 0,082495450973511 |
21480 | 672 | 3,13328719139099 | -0,07208776473999 |
23568 | 737 | 3,30896997451782 | 0,175682783126831 |
24498 | 766 | 3,38099765777588 | 0,072027683258057 |
25572 | 800 | 3,47666311264038 | 0,095665454864502 |
26964 | 843 | 3,50756669044495 | 0,030903577804566 |
28179 | 881 | 3,61467909812927 | 0,107112407684326 |
29424 | 920 | 3,70178055763245 | 0,087101459503174 |
30789 | 963 | 3,94584131240845 | 0,244060754776001 |
32325 | 1011 | 4,00474500656128 | 0,058903694152832 |
33840 | 1058 | 4,01952767372131 | 0,014782667160034 |
35211 | 1101 | 4,11354827880859 | 0,09402060508728 |
36708 | 1148 | 4,28623414039612 | 0,172685861587524 |
38217 | 1195 | 4,39107799530029 | 0,104843854904175 |
39825 | 1245 | 4,42315292358398 | 0,032074928283691 |
41358 | 1293 | 4,66109704971314 | 0,23794412612915 |
43305 | 1354 | 4,88815975189209 | 0,227062702178955 |
44958 | 1405 | 4,89233231544495 | 0,004172563552856 |
46629 | 1458 | 5,56133365631104 | 0,669001340866089 |
48441 | 1514 | 5,58660101890564 | 0,025267362594605 |
50415 | 1576 | 5,90663361549377 | 0,320032596588135 |
52341 | 1636 | 6,04964709281921 | 0,143013477325439 |
54444 | 1702 | 6,23829650878906 | 0,188649415969849 |
56142 | 1755 | 6,32623052597046 | 0,087934017181397 |
58515 | 1829 | 6,49991869926453 | 0,173688173294067 |
60414 | 1888 | 6,56362867355347 | 0,063709974288941 |
62562 | 1956 | 6,86830520629883 | 0,304676532745361 |
64830 | 2026 | 6,90527391433716 | 0,03696870803833 |
66906 | 2091 | 7,02717351913452 | 0,121899604797363 |
69921 | 2186 | 7,35353469848633 | 0,326361179351807 |
72279 | 2259 | 7,55564570426941 | 0,202111005783081 |
72495 | 2266 | 7,51527237892151 | -0,0403733253479 |
77148 | 2411 | 7,8772234916687 | 0,361951112747192 |
79191 | 2475 | 7,98567771911621 | 0,10845422744751 |
81540 | 2549 | 8,17700910568237 | 0,191331386566162 |
84534 | 2642 | 8,2921895980835 | 0,115180492401123 |
87087 | 2722 | 8,45307517051697 | 0,160885572433472 |
89712 | 2804 | 8,62603664398193 | 0,172961473464966 |
92592 | 2894 | 9,14751052856445 | 0,52147388458252 |
95766 | 2993 | 9,43603682518005 | 0,288526296615601 |
98880 | 3090 | 9,76397943496704 | 0,327942609786987 |
101697 | 3179 | 10,0979392528534 | 0,333959817886353 |
104721 | 3273 | 10,1487267017365 | 0,050787448883057 |
107736 | 3367 | 10,4614281654358 | 0,312701463699341 |
110937 | 3467 | 10,7668445110321 | 0,305416345596313 |
113958 | 3562 | 11,3086459636688 | 0,541801452636719 |
117774 | 3681 | 11,4619560241699 | 0,153310060501099 |
120948 | 3780 | 11,6951687335968 | 0,23321270942688 |
124188 | 3881 | 12,1348485946655 | 0,439679861068726 |
127650 | 3990 | 12,4355757236481 | 0,300727128982544 |
Could an out-of-order command queue with explicit synchronization avoid the gap between the execution of subsequent kernels? Is that worth experimenting with? If the gaps could be avoided, for short running kernels theoretically about 50% (or more) of additional time could be achieved, execution time cut by 1/3 to 1/2. Or do I miss something?
Could an out-of-order command queue with explicit synchronization avoid the gap between the execution of subsequent kernels?
The gap reflects the work the driver and GPU have to do to set up a new kernel. It's independent of any synchronization between kernels.
CUDA 10 introduced a new mechanism that can precompute some of that work to reduce the overhead. See https://developer.nvidia.com/blog/cuda-graphs. Some day when we move to CUDA 10 as our minimum supported version I hope to be able to make use of it. But there's no similar mechanism in OpenCL.
According to documentation in out-of-order execution kernels should run in parallel, so the next kernel might already be set up and started while the previous is still running, thus closing/avoiding that gap. But I tried and on my platform it does not seem to work, kernels are not started in parallel, but one after the other :(.
@ThWuensche Your trace has a significant amount of time in findBlocksWithInteractions
. Can you try changing 32 to 64 in findInteractingBlocks.cl to check if performance is better or worse?
#if SIMD_WIDTH <= 32
That makes it spend even more time there. And more of a question, also the time in computeNonbonded is increasing. How could that be related to the blocksize in another kernel, should not have influence on the data? Probably worth to analyze, will see whether I'll understand what's going on. Does that buffer size have influence on the actual data, like more data to be computed in computeNonbonded?
Then in that kernel a lot is fixed on "32", also things depending on local_index. That probably lets two "warps" go into a "wavefront".
That makes it spend even more time there. And more of a question, also the time in computeNonbonded is increasing. How could that be related to the blocksize in another kernel, should not have influence on the data? Probably worth to analyze, will see whether I'll understand what's going on. Does that buffer size have influence on the actual data, like more data to be computed in computeNonbonded?
Changing the if
makes GCN use the same code path as NVIDIA and Navi. I wanted to confirm that the comments around L277 are still accurate for wide GCN GPUs.
Buffer size might be optimized to fit within local memory for NVIDIA, which is larger than GCN.
CL_DEVICE_LOCAL_MEM_SIZE
AMD GCN: 32KB
NVIDIA: 48KB
AMD Navi: 64KB
Just saw that when I looked into the code. It's not only size of that buffer, but selects a completely different implementation.
But why does that have influence on computeNonbonded?
Regarding the local memory LDS (local data share) on my version of GCN should be 64KB, don't know whether it was less on older GCN versions.
The other question is why these calculations (findBlocksWithInteractions
and computeNonbonded
) take more time for lower atom counts than for higher atom counts.
But why does that have influence on computeNonbonded?
I don't know enough about what's happening internally to know whether the kernels are stalling each other or the results of the two algorithms are slightly different because of things like repeated multiplications or partial sums.
Regarding the local memory LDS (local data share) on my version of GCN should be 64KB, don't know whether it was less on older GCN versions.
CompuBench reports 32KB for Radeon VII. RX 460 (GCN 4.0) and Spectre R7 Graphcis APU (GCN 2.0) have 32KB as well.
The other question is why these calculations (
findBlocksWithInteractions
andcomputeNonbonded
) take more time for lower atom counts than for higher atom counts.
As a percentage of total execution time or average wall duration per call?
CompuBench reports 32KB for Radeon VII. RX 460 (GCN 4.0) and Spectre R7 Graphcis APU (GCN 2.0) have 32KB as well.
Think they got something wrong. My Radeon VII is gfx906, that's part of the output of clinfo:
Local memory size: 65536
...
Name: gfx906
Vendor: Advanced Micro Devices, Inc.
As a percentage of total execution time or average wall duration per call?
Both, I think. The simulated system has been created by a script from Peter Eastman in that issue. The traces above are from two runs of that script with atom count and run time according to lines 2 and 3 in the table, the run with the bigger atom count has about 8% lower runtime, correlating with the lower time for these two kernels in the traces.
Think they got something wrong. My Radeon VII is gfx906, that's part of the output of clinfo:
Local memory size: 65536
Interesting, is that the value for
CL_DEVICE_LOCAL_MEM_SIZE
orCL_DEVICE_LOCAL_MEM_SIZE_PER_COMPUTE_UNIT_AMD
?
RX 460:
Local memory size 32768 (32KiB)
Local memory syze per CU (AMD) 65536 (64KiB)
This guide mentions LDS is 64KB with a maximum allocation of 32KB per workgroup.
Found this tidbit about Vega LDS size:
[C]urrently Vega has 64 KB local/shared memory enabled on Linux, but 32 KB on Windows.
I meant LDS and LDS is per CU. I was not aware of the limitation 32KB per workgroup.
I just tried, I can define a buffer:
__local uint[8192+4096]
and the compiler does not complain.
for (j=8192+4096-1; j>=0; j--)
buffer[j] = (uint)j;
out[0] = buffer[8192+4096-1];
out[0] comes up with 0x2fff, so guess I have more than 32KB.
That is with ROCm 3.7 on Linux.
Changing the if makes GCN use the same code path as NVIDIA and Navi. I wanted to confirm that the comments around L277 are still accurate for wide GCN GPUs.
They're still correct. On a GPU with a SIMD width of 64, that kernel has severe thread divergence.
And more of a question, also the time in computeNonbonded is increasing.
I don't know of any reason that would be affected.
I'm just reading computeNonbonded and thinking about thread divergence. There is the distinction between tiles on diagonal and off diagonal. If on a wavefront size of 64 two tiles are in one wavefront and one is on diagonal, one off diagonal, the wavefront probably would see rather large thread divergence, is that correct?
I don't know of any reason that would be affected.
The traces above seem to indicate effect onto compute time for computeNonbonded depending on that selection. Not to make false conclusions will recheck.
If on a wavefront size of 64 two tiles are in one wavefront and one is on diagonal, one off diagonal, the wavefront probably would see rather large thread divergence, is that correct?
Correct. That's why we sort the tiles differently on GPUs with SIMD width 64, to put all the diagonal ones together and all the off diagonal ones together.
I don't know of any reason that would be affected.
The traces above seem to indicate effect onto compute time for computeNonbonded depending on that selection. Not to make false conclusions will recheck.
Confirmed that it does make a difference. First trace is with standard findInteraction for AMD, second with algorithm for other platforms. Execution time of computeNonbonded gets significantly longer.
Correct. That's why we sort the tiles differently on GPUs with SIMD width 64, to put all the diagonal ones together and all the off diagonal ones together.
Thanks for the explanation. Had noticed that difference in the sort, but did not understand the meaning/consequences.
About the results from the test script: What gives me concerns is that from width 5.25 to width 5.5 the execution time drops by 0.2s, even though the system size increases. That difference comes just from these two kernels. That related with above effect leaves a question mark for me.
Try modifying the line in the script that creates the context to instead be
context = Context(system, integrator, Platform.getPlatformByName('OpenCL'), {'DisablePmeStream':'true'})
After you do that, do you still see a difference in the speed of the nonbonded kernel?
Difference remains: I will try to include trace markers to find where it happens. Edit: Markers in roctracer seem to be related to host code, not kernel code. So probably that way does not exist, don't know how to extract runtime of parts of kernel code, like runtime of a loop.
I'm searching all day, but without success: Is there any way to read the openCL device timestamp (or another timestamp, counter) directly on the device, so that it can in an easy way be used to save timestamps reflecting the progress of a kernel (like one timestamp saved before a loop, another saved after end of the loop)?
Only thing I find are events, including profiling events, but that does not seem to be an easy way to get timestamps during the progress through a kernel.
Edit: Searching how to access S_MEMTIME
Ok, seems I managed that with some inline assembler. Will test runtime of the two loops now.
Ok, here is the result from adding the outputs of in-kernel time measurements for the first and second loop of computeNonbonded. Interesting, the difference in accumulated time of computeNonbonded between the two sort variants, which was visible in the trace, is (almost) not replicated in that output. However the difference in the total time, where the smaller system takes much more time than the larger, is repeated. The difference stems from the second loop.
I have added the modified nonbonded.cl (renamed to .txt) for verification of what I have done/calculated.
Lists | interactAMD_loop1 | interactAMD_loop2 | interactSTD_loop1 | interactStd_loop2 |
---|---|---|---|---|
14190/444 | 58592 | 9843306 | 60444 | 10080088 |
16284/509 | 62540 | 6025212 | 62732 | 6189640 |
Could the longer time on smaller system be related to efficiency effects of "singlePeriodicCopy" code diversion?
Edit: Seems like that, on the first, smaller system singlePeriodicCopy is 0, on the larger system, which runs faster, it is 1. So the fact that the lower runtime occurs right in that region, where the number of atoms crosses the max number of threads on the Radeon VII, was by chance, but the later is not the reason for the lower runtime. With that double influence hard to say how the speed of the simulation changes by the atom count jumping over the number of concurrent threads (the original question of the thread).
Side note: I just realized that the number of times the second loop is executed is 7 for the smaller system, 5 for the larger system. 5 I understand, 7 not. (5 resulting as 32 * (16248 atoms / 32)^2 / (4*15360)
, with 32 being tilesize, 15360 total number of SIMD lanes ???).
In nonbonded.cl:
That probably should never occur, handled on higher level?
unsigned int numTiles = interactionCount[0];
if (numTiles > maxTiles)
return; // There wasn't enough memory for the neighbor list.
If it would occur, interactions would not be calculated, right?
Try modifying the line in the script that creates the context to instead be
context = Context(system, integrator, Platform.getPlatformByName('OpenCL'), {'DisablePmeStream':'true'})
Will that actually do anything on AMD? PmeStream seems to be NVIDIA only.
usePmeQueue = (!cl.getPlatformData().disablePmeStream && isNvidia);
Would that version work and reduce thread divergence?
In findInteractingBlocks.cl:
/**
* Perform a parallel prefix sum over an array. The input values are all assumed to be 0 or 1.
*/
void prefixSum(__local short* sum, __local ushort2* temp) {
for (int i = get_local_id(0); i < BUFFER_SIZE; i += get_local_size(0))
temp[i].x = sum[i];
barrier(CLK_LOCAL_MEM_FENCE);
int whichBuffer = 0;
for (int offset = 1; offset < BUFFER_SIZE; offset *= 2) {
#if 0
if (whichBuffer == 0)
for (int i = get_local_id(0); i < BUFFER_SIZE; i += get_local_size(0))
temp[i].y = (i < offset ? temp[i].x : temp[i].x+temp[i-offset].x);
else
for (int i = get_local_id(0); i < BUFFER_SIZE; i += get_local_size(0))
temp[i].x = (i < offset ? temp[i].y : temp[i].y+temp[i-offset].y);
whichBuffer = 1-whichBuffer;
#endif
for (int i = get_local_id(0); i < BUFFER_SIZE; i += get_local_size(0))
if (whichBuffer == 0)
temp[i].y = (i < offset ? temp[i].x : temp[i].x+temp[i-offset].x);
else
temp[i].x = (i < offset ? temp[i].y : temp[i].y+temp[i-offset].y);
whichBuffer = 1-whichBuffer;
barrier(CLK_LOCAL_MEM_FENCE);
}
#if 0
if (whichBuffer == 0)
for (int i = get_local_id(0); i < BUFFER_SIZE; i += get_local_size(0))
sum[i] = temp[i].x;
else
for (int i = get_local_id(0); i < BUFFER_SIZE; i += get_local_size(0))
sum[i] = temp[i].y;
#endif
for (int i = get_local_id(0); i < BUFFER_SIZE; i += get_local_size(0))
sum[i] = whichBuffer?temp[i].y:temp[i].x;
barrier(CLK_LOCAL_MEM_FENCE);
}
Edit: I benchmarked that, but don't see improvements. Maybe the compiler optimizes the standard version - or there is no improvement in the new code.
Question is, whether the short times result from the fact that the kernel returns immediately
Correct.
Does the kernel have to be called in that case at all or could the call overhead be eliminated (or does that value come from another kernel, so that the host does not know whether is has to be scheduled - however could not find that being used in any other place)?
Exactly. It first gets set to 0 in findBlockBounds()
. If any atom has moved far enough to require the neighbor list to be rebuilt, it gets set to 1 in sortBoxData()
.
Could the longer time on smaller system be related to efficiency effects of "singlePeriodicCopy" code diversion?
Edit: Seems like that, on the first, smaller system singlePeriodicCopy is 0, on the larger system, which runs faster, it is 1. So the fact that the lower runtime occurs right in that region, where the number of atoms crosses the max number of threads on the Radeon VII, was by chance, but the later is not the reason for the lower runtime. With that double influence hard to say how the speed of the simulation changes by the atom count jumping over the number of concurrent threads (the original question of the thread).
Side note: I just realized that the number of times the second loop is executed is 7 for the smaller system, 5 for the larger system. 5 I understand, 7 not. (5 resulting as
32 * (16248 atoms / 32)^2 / (4*15360)
, with 32 being tilesize, 15360 total number of SIMD lanes ???).
I see a similar saw tooth between 14514 and 15477 on my RX 460, which has 896 shaders. I also see one between 3762 and 4158. (Times are using the script but with 600 steps instead of 3000 to compensate for less than 1/5 FLOPS.)
I see a similar saw tooth between 14514 and 15477 on my RX 460, which has 896 shaders. I also see one between 3762 and 4158.
I think the first sawtooth might come from the search changing from sortShortList to general sort algorithm. For me that was at higher system size, but it also depends on shader count. Did you apply the patch?
Actually I now wonder whether the "max" in that code line is correct, shouldn't it be a min, so that local memory is not exceeded?
int maxShortList = max(maxLocalBuffer, (int) OpenCLContext::ThreadBlockSize*context.getNumThreadBlocks());
But should that decision also depend on which of the two sortShort implementations is used further on? sortShortList2 depends on the number of maximum parallel threads, while sortShortList anyhow is limited to one CU, so depends only on the memory limit, not on number of available parallel threads.
The second sawtooth seems to come from the change of algorithm according to singlePeriodicCopy. I included the following debug output in nonbonded.cl (at the end of the second loop):
}
pos++;
if (get_global_id(0) == 0)
printf("loopcnt %d, singlePeriodic %d\n",loopcnt,singlePeriodicCopy);
}
If you do that, what does it show?
BTW, the way I found to get timestamps at different places in the code is based on that piece of inline assembler:
if (get_global_id(0) == 0)
__asm__ volatile ("s_memrealtime %0\n s_waitcnt lgkmcnt(0)" : "=s" (end_time) );
where end_time is a variable in kernel: unsigned long end_time;
There also is another counter s_memtime, which is not clock compensated.
That's of course not portable, NVidia probably would have a problem with it. Don't know whether that also works on the RX460.
Actually I now wonder whether the "max" in that code line is correct, shouldn't it be a min, so that local memory is not exceeded?
That needs cleaning up a little. The max is correct, but it should only be done on NVIDIA. That's the maximum size to efficiently use the sortShortList2()
kernel.
Actually I now wonder whether the "max" in that code line is correct, shouldn't it be a min, so that local memory is not exceeded?
That needs cleaning up a little. The max is correct, but it should only be done on NVIDIA. That's the maximum size to efficiently use the
sortShortList2()
kernel.
But shouldn't it still be min
, limiting the threads either by memory or SIMD lanes, whatever is lower? As is, if SIMD units are more than memory allows, that would exceed memory ressources. Of course there later is the try/catch, which would probably fall back to standard anyhow, if memory can not be allocated. But maybe I miss something.
sortShortList2()
doesn't use local memory. If we're going to use that kernel, the amount of memory doesn't matter.
Ok, thanks, so the first term in the max() is actually not relevant for sortShortList2 and the second not relevant for sortShortList.
I know I rather spammed the channel, but would like to come back to this question, since if not catched in another place it could lead to wrong calculations depending on system size.
I just tried to run RUN11 (the big system with 448584 atoms) from F@H benchmarks again on two GPUs in parallel, but get NaN, while with one GPU it is working:
Traceback (most recent call last):
File "run-core-xml.py", line 56, in <module>
integrator.step(nsteps)
File "/usr/local/lib/python3.7/dist-packages/simtk/openmm/openmm.py", line 7658, in step
return _openmm.LangevinIntegrator_step(self, steps)
simtk.openmm.OpenMMException: Particle coordinate is nan
Trying again I get:
Start simulation
Memory access fault by GPU node-1 (Agent handle: 0x1dae860) on address 0x7f2ea47b6000. Reason: Unknown.
Abgebrochen
Here is the trace data for the case with the Memory access fault: results.txt
Not important for me, since I'm mostly looking at the use in F@H, which does not use multiple GPUs in one work unit, but still wanted to mention. If you need more tracing, please let me know.
If it would occur, interactions would not be calculated, right?
Correct. That indicates we haven't allocated enough memory on the GPU to hold the neighbor list. In that case we detect the problem on the host side and allocate larger arrays.
We then set a couple of flags:
The first one ensures the neighbor list will be rebuild on the next attempt, and the second one informs the Context that something went wrong in computing the forces and it should try again.
Ok, thanks, as usual you considered everything! Just mentioned, since this is the kind of problems easily to overlook, quick fix where it hurts ... (speaking of own experience) - so I'm used to bring up such things when I see them.
Still would be good to have a feedback for such errors out of the kernel, instead of just not calculating. But that is not so easy due to asynchronous nature of kernel dispatches. Maybe some error report structure in memory (just as proposal for error handling for the future).
Absolutely! It actually took a while to arrive at this solution. Originally if there wasn't enough memory for the neighbor list, it would go ahead and compute the forces with a less efficient algorithm that didn't use the neighbor list (and also set a flag to tell us to allocate more memory for the next step). This could lead to taking multiple seconds to execute that one kernel. On Linux that was fine, but on Windows that OS would kill the process.
Still would be good to have a feedback for such errors out of the kernel
This isn't really an error. At the start of a simulation we have little idea how much memory the neighbor list will take up. So we just make a guess, and if the guess happens to be too low we'll discover that right away and allocate more memory. This is part of the normal functioning of the kernel.
Over the last time I have been trying to analyze the low performance of AMD GPUs (GCN) on small projects, trying to find improvements. Now I'm considering a possible reason for the observation of low performance on small atom count simulations. That observation - if confirmed - could help project owners to systematically decide which projects will have reasonable performance on AMD GCN-type GPUs and which not, thus helping to decide which projects to assign to different GPUs.
AMD GCN has a wavefront size of 64 threads, however these 64 threads are not executed on a full compute unit (CU) in parallel, but on one SIMD unit (out of 4 in a CU) sequentialized into four consecutive 16-thread parts, thus taking 4 cycles for a wavefront. In a CU there are 4 SIMD units, the Radeon VII (my case) has 60 CUs. As a conclusion, to occupy the GPU completely, not 3840 threads are required, as that would seem obvious (wavefront size 64 60 CUs), but 15360 threads (wavefront size 64 (4 SIMDs/CU) * 60 CUs) and needing 4 cycles. That construction makes the architecture very wide, wider than even the NVidia RTX3090, but executing that wide thread count not in one, but four cycles (effectively 3840 threads per cycle).
If my conclusion is correct, that would explain why AMD GCN devices are much more sensitive to small projects than even the widest NVidia GPUs. I have to confirm that I don't really know how NVidia handles it's warps, but as far as I understand, they are executed in parallel (not partially serialized like AMD GCN).
If we correlate the above thoughts with an excerpt of kernel call characteristics for a sample project 16921(79,24,68)
we see that many kernels of that project run with 21016 threads (the project is specified with 21000 atoms in the project list). Assuming that not different kernels are executed in parallel, that means that the first 15360 threads will fully occupy the GPU, but the remaining 5656 threads only with about 1/3 of the capacity.
Other projects, like RUN9 of the benchmark projects (with 4071 atoms), will load the Radeon VII only at about 25% for many of the important kernels.
On the opposite in large projects the part of ineffective GPU load will create a small share only, that's where in general good performance of GCN devices is observed.
As a conclusion projects with slightly less than x * effective thread width (15360 for Radeon VII, gfx906) should run well, projects slightly above will run rather poor (specially if not covered by a very high atom count). So a project with 15000 atoms probably will run well, a project with 16000 atoms rather poor. For GPU use optimization that relation probably should be considered besides a general classification of a project to a group of GPUs.
Critical cross check of above thoughts is welcome!