Open LonelyCat124 opened 4 years ago
Looks like (from the profiler) the Legion implementation is buggy and that is causing the issue :)
Ok - this may be (partly) an issue with the input file.
Ok, i fixed up the code and generateed a new input file with better separated particles (instead of the insane previous version, where all particles were in cells (0,0,0), (1,1,1) etc.)
New results coming, unfortunately it appears legion isn't doing as well as I'd have hoped...even potentially worse?! : | version | run1 | run2 | run3 |
---|---|---|---|---|
OpenMP task (1 thread) | 0.18s | 0.17s | 0.17s | |
OpenMP task (4 threads) | 0.056s | 0.058s | 0.056s | |
Legion ("serial") | 509.66s | 973.21 | N/A |
Tomorrow I want to check the number of interactions being computed for a specific cell is comparable for Legion version and OpenMP version - it should be but its definitely possible my Legion code has a bug in causing the massive performance loss, or the OpenMP version has a bug causing it to be way too fast.
But since the particles are randomly distributed, and there are 100k in 1000 cells (so ~=100 per cell) I'd lean towards the OpenMP performance being accurate.
None of the Legion debugging tools found any issue
Checking in the self task, the cell is always == when doing an interaction, which is expected. Checking cell counts now but it looks about right.
Only checked a single cell so far, but the interaction count is correct (!) meaning that Legion is actually just 2500 - 10k X slower so far.
Looked at a random selection and the interaction counts looks correct so the code is not bugged, just slow.
So, tried turning off the self_task
to see if that makes much difference:
OpenMP version (1 thread): 0.003s
Legion version: 400s or more.
Trying to profile to work out whats going on.
Ok, so the partition being aliased is KILLING the performance, though its not great even with just an equal
partition (which is wrong, but not the point for now).
With an equal partition: 9.17s
With the partition by cell: 655.479527s
Is one solution to caps cells at MAX_PARTS_IN_A_CELL (which in the real code is computed anyway) + PADDING, then multiply that number by N_CELLS to get the region size. Once we have that, just do an equal partition and copy the data into the cells based on that. Its got to be ~50-100x faster than the non-equal partition probably.
The alternative is possibly that the mapper is bad for the non-equal partitions, but I'd need to ask the Legion guys if that was the case
Moved to scafellpike interactive node, using 24 threads (and 4 util threads for Legion extra).
OpenMP (gcc9): 0.018s vs 1 thread: 0.18s (10x speedup) Legion (equal partition, wrong results): 3.98s vs 1 thread: 3.81s Legion (correct partition): 48.25s vs 1 thread: 44.73s (so no parallelism?!)
To also be fairer in case its mostly a mapping issue, I included the cell construction (but not file reading): OpenMP: 1.185s parallel (mapping is serial) vs 1 thread: 1.350s
I also addeda fence before the timing in the Legion version but that didn't seem to affect the timings.
Adding -g and running through vtune now.
So from a bit of analysis:
-------------------------
Meta-Task Statistics
-------------------------
Defer Task Perform Mapping
Total Invocations: 3000
Total Time: 998051440 us
Average Time: 332683.81 us
Maximum Time: 7988161 us (11085.836 sig)
Minimum Time: 11 us (-481.740 sig)
Defer Task Launch
Total Invocations: 2000
Total Time: 187984826 us
Average Time: 93992.41 us
Maximum Time: 1191883 us (4302.202 sig)
Minimum Time: 56 us (-368.096 sig)
This is the majority of the runtime, the tasks themselves are fine:
Self task [self task]
Total Invocations: 1000
Total Time: 1922659 us
Average Time: 1922.66 us
Maximum Time: 3187 us (75.660 sig)
Minimum Time: 949 us (-58.219 sig)
Timestep task [timestep task]
Total Invocations: 1000
Total Time: 108830 us
Average Time: 108.83 us
Maximum Time: 814 us (147.933 sig)
Minimum Time: 68 us (-8.531 sig)
Init task [init task]
Total Invocations: 1
Total Time: 100296 us
Average Time: 100296.72 us
Maximum Time: 100296 us (0.000 sig)
Minimum Time: 100296 us (0.000 sig)
1.9s in the self_task - longer than the OpenMP code but low enough I'd be happy with.
Might be worth getting these statistics for RegentParticleDSL run too when I can.
Trying to view a trace to see whats happening for the test code at the moment.
Ok, so util threads appear to not just always be a benefit.
| Util threads | run1 | run2 | run3 |
|-----|------|-------------|-------|
|no args| 29.71s | 22.54s | 22.09s|
| 1 | 30.31s | 36.19s | 29.02s |
| 2 | 64.98s | 64.28s | 71.87s |
| 3 | 47.36s | 47.92 | 58.12s |
| 4 | 42.18s | 42.32s | 49.21s |
With no util args and 24 cpu threads the runtime is 32s, so parallelism is not a clear winner either way
Ok - so the first step is super expensive for Legion, so I'll run a step for legion and then time 10 iterations after.
I also changed to ATOMIC
coherence mode, with a barrier at the end of each step first.
The OpenMP code is still including the intial "map" and isn't using mutexinoutset
.
version | run1 | run2 | run3 |
---|---|---|---|
OpenMP task (1 thread) | 2.17s | 2.19 | 2.19 |
OpenMP task (24 threads) | 1.32s | 1.29s | 1.31s |
Legion ("serial") | 28.83s | 28.13s | 28.54s |
Legion (24 threads) | 27.98s | 27.78s | 27.91s |
Also tested with Legion's control replication branch - I don't know if this affects the runtime or just Regent. No difference.
These are currently only run on the laptop so accuracy +- a lot.
For the easy version (no pair cell tasks, only a single timestep task on each cell and a single self task on each cell).
This comparison may is slightly unfair as the serial code has no overheads. The OpenMP task implementation is super simple, and this is all with gcc 7.4 so kinda old.