Performance comparison - Githubissues

LonelyCat124 commented 4 years ago

These are currently only run on the laptop so accuracy +- a lot.

For the easy version (no pair cell tasks, only a single timestep task on each cell and a single self task on each cell).

version	run1	run2	run3
Serial	13.18s	13.35s	13.81s
OpenMP task (1 thread)	13.10s	13.42s	14.93s
OpenMP task (4 threads)	5.75s	5.33s	6.13s
Legion ("serial")	166.54s	176.04s	182.11s

This comparison may is slightly unfair as the serial code has no overheads. The OpenMP task implementation is super simple, and this is all with gcc 7.4 so kinda old.

LonelyCat124 commented 4 years ago

Looks like (from the profiler) the Legion implementation is buggy and that is causing the issue :)

LonelyCat124 commented 4 years ago

Ok - this may be (partly) an issue with the input file.

LonelyCat124 commented 4 years ago

Ok, i fixed up the code and generateed a new input file with better separated particles (instead of the insane previous version, where all particles were in cells (0,0,0), (1,1,1) etc.)

New results coming, unfortunately it appears legion isn't doing as well as I'd have hoped...even potentially worse?! :	version	run1	run2
OpenMP task (1 thread)	0.18s	0.17s	0.17s
OpenMP task (4 threads)	0.056s	0.058s	0.056s
Legion ("serial")	509.66s	973.21	N/A

Tomorrow I want to check the number of interactions being computed for a specific cell is comparable for Legion version and OpenMP version - it should be but its definitely possible my Legion code has a bug in causing the massive performance loss, or the OpenMP version has a bug causing it to be way too fast.

But since the particles are randomly distributed, and there are 100k in 1000 cells (so ~=100 per cell) I'd lean towards the OpenMP performance being accurate.

LonelyCat124 commented 4 years ago

None of the Legion debugging tools found any issue

LonelyCat124 commented 4 years ago

Checking in the self task, the cell is always == when doing an interaction, which is expected. Checking cell counts now but it looks about right.

LonelyCat124 commented 4 years ago

Only checked a single cell so far, but the interaction count is correct (!) meaning that Legion is actually just 2500 - 10k X slower so far.

LonelyCat124 commented 4 years ago

Looked at a random selection and the interaction counts looks correct so the code is not bugged, just slow.

LonelyCat124 commented 4 years ago

So, tried turning off the self_task to see if that makes much difference: OpenMP version (1 thread): 0.003s Legion version: 400s or more.

Trying to profile to work out whats going on.

LonelyCat124 commented 4 years ago

Ok, so the partition being aliased is KILLING the performance, though its not great even with just an equal partition (which is wrong, but not the point for now). With an equal partition: 9.17s With the partition by cell: 655.479527s

Is one solution to caps cells at MAX_PARTS_IN_A_CELL (which in the real code is computed anyway) + PADDING, then multiply that number by N_CELLS to get the region size. Once we have that, just do an equal partition and copy the data into the cells based on that. Its got to be ~50-100x faster than the non-equal partition probably.

The alternative is possibly that the mapper is bad for the non-equal partitions, but I'd need to ask the Legion guys if that was the case

LonelyCat124 commented 4 years ago

Moved to scafellpike interactive node, using 24 threads (and 4 util threads for Legion extra).

OpenMP (gcc9): 0.018s vs 1 thread: 0.18s (10x speedup) Legion (equal partition, wrong results): 3.98s vs 1 thread: 3.81s Legion (correct partition): 48.25s vs 1 thread: 44.73s (so no parallelism?!)

To also be fairer in case its mostly a mapping issue, I included the cell construction (but not file reading): OpenMP: 1.185s parallel (mapping is serial) vs 1 thread: 1.350s

I also addeda fence before the timing in the Legion version but that didn't seem to affect the timings.

Adding -g and running through vtune now.

LonelyCat124 commented 4 years ago

So from a bit of analysis:

  -------------------------
  Meta-Task Statistics
  -------------------------
  Defer Task Perform Mapping
       Total Invocations: 3000
       Total Time: 998051440 us
       Average Time: 332683.81 us
       Maximum Time: 7988161 us (11085.836 sig)
       Minimum Time: 11 us (-481.740 sig)

  Defer Task Launch
       Total Invocations: 2000
       Total Time: 187984826 us
       Average Time: 93992.41 us
       Maximum Time: 1191883 us (4302.202 sig)
       Minimum Time: 56 us (-368.096 sig)

This is the majority of the runtime, the tasks themselves are fine:

  Self task [self task]
       Total Invocations: 1000
       Total Time: 1922659 us
       Average Time: 1922.66 us
       Maximum Time: 3187 us (75.660 sig)
       Minimum Time: 949 us (-58.219 sig)

  Timestep task [timestep task]
       Total Invocations: 1000
       Total Time: 108830 us
       Average Time: 108.83 us
       Maximum Time: 814 us (147.933 sig)
       Minimum Time: 68 us (-8.531 sig)

  Init task [init task]
       Total Invocations: 1
       Total Time: 100296 us
       Average Time: 100296.72 us
       Maximum Time: 100296 us (0.000 sig)
       Minimum Time: 100296 us (0.000 sig)

1.9s in the self_task - longer than the OpenMP code but low enough I'd be happy with.

Might be worth getting these statistics for RegentParticleDSL run too when I can.

Trying to view a trace to see whats happening for the test code at the moment.

LonelyCat124 commented 4 years ago

Ok, so util threads appear to not just always be a benefit.

| Util threads | run1 | run2 | run3 |
|-----|------|-------------|-------|
|no args| 29.71s | 22.54s | 22.09s|
| 1 | 30.31s  | 36.19s  | 29.02s |
| 2 | 64.98s  | 64.28s | 71.87s |
| 3 | 47.36s | 47.92  | 58.12s |
| 4 | 42.18s  | 42.32s  | 49.21s |

With no util args and 24 cpu threads the runtime is 32s, so parallelism is not a clear winner either way

LonelyCat124 commented 4 years ago

Ok - so the first step is super expensive for Legion, so I'll run a step for legion and then time 10 iterations after. I also changed to ATOMIC coherence mode, with a barrier at the end of each step first. The OpenMP code is still including the intial "map" and isn't using mutexinoutset.

version	run1	run2	run3
OpenMP task (1 thread)	2.17s	2.19	2.19
OpenMP task (24 threads)	1.32s	1.29s	1.31s
Legion ("serial")	28.83s	28.13s	28.54s
Legion (24 threads)	27.98s	27.78s	27.91s

Also tested with Legion's control replication branch - I don't know if this affects the runtime or just Regent. No difference.

LonelyCat124 / ParticleTest

Performance comparison #1