StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
657 stars 146 forks source link

Reductions cause a deadlock when using Legion with CUDA #1691

Closed tukss closed 2 weeks ago

tukss commented 1 month ago

One of our codes using FleCSI with the Legion backend consistently runs into a deadlock when performing a reduction on a field. The code gets stuck when trying to get the result of the reduction without actually launching the reduction task itself. It only occurs when at least two ranks are involved, a ghost copy had to occur some time before, and the tasks are launched on a GPU with CUDA. This happens on master and the current 24.03.0 release.

Unfortunately I can't share a full reproducer but we could internally narrow it down to merge commit dde5f0c30cbaaf67fa47b32364f5976c6f1d1aa7. We were using the control replication branch before that was merged.

In particular we found that changing supports_scatter_gather_path in runtime/realm/cuda/cuda_internal.cc to always return false makes the problem consistently go away.

I'm happy to try out any fixes you come up with.

lightsighter commented 1 month ago

To be clear, I think given the evidence that this is a hang and not a deadlock. One of the Realm GPU reduction copies is not making forward progress.

@tukss: Can you do a run with -level dma=1 and upload the logs from a run that hangs? Independently, can you also try a run that sets the flag -gex:obcount using the formulate provided in the second bullet of this comment and see if it still hangs? https://github.com/StanfordLegion/legion/issues/1508#issuecomment-1633273682

@artempriakhin The changes in the bad commit from the bisection look like yours. Can you take a look at this?

apryakhin commented 1 month ago

Sure, taking a look

apryakhin commented 1 month ago

@tukss along with what @lightsighter has already asked could you please provide logs from the following: -level gpu=1,gpudma=1,dma=1,xplan=1,xpath=1?

Realm supports neither scatter/gather-reduce nor GPU scatter/gather-reduce at the moment, so supports_scatter_gather_path should return false independently of the change if we are doing GPU reductions now. Unless, we are doing gpu reductions AND gpu scatter/gather as separate operations and somehow the later when done together results into this.

apryakhin commented 1 month ago

@tukss Is there way you run "the code" by perhaps artificially disabling GPU reduction and then just leaving the ghost copy? It want to make that the ghost copy (which is likely scatter/gather) itself completes without any issues. If we could isolate it that would simplify things to begin with.

tukss commented 1 month ago

@tukss Is there way you run "the code" by perhaps artificially disabling GPU reduction and then just leaving the ghost copy? It want to make that the ghost copy (which is likely scatter/gather) itself completes without any issues. If we could isolate it that would simplify things to begin with.

It still hangs if I just move the reduction from GPU to CPU and keep the rest the same (i.e., on the GPU). Disabling the reduction entirely makes the code run.

tukss commented 1 month ago

@tukss along with what @lightsighter has already asked could you please provide logs from the following: -level gpu=1,gpudma=1,dma=1,xplan=1,xpath=1?

I've run the program with 2 MPI tasks and passed these options through REALM_DEFAULT_ARGS. This is the full output.

log.txt

tukss commented 1 month ago

Independently, can you also try a run that sets the flag -gex:obcount using the formulate provided in the second bullet of this comment and see if it still hangs?

Setting -gex:obcount 10 (1 GPU, 2 MPI ranks) results in the same hang.

tukss commented 1 month ago

We also found that this only happens when two MPI ranks are sharing one GPU. Scheduling to two separate GPUs (one per rank) makes the hang go away.

lightsighter commented 1 month ago

@artempriakhin Just a clue, there are 2 DMA requests that have started but have not completed (perhaps it is a deadlock after all, but inside Realm and not at the Legion level):

mebauer@ flecsi % grep -rI 'dma request' log.txt | grep 'started' | wc
     850   11050  122715
mebauer@ flecsi % grep -rI 'dma request' log.txt | grep 'completed' | wc
     848   11024  124105

Looks like the two bad requests are:

mebauer@flecsi % grep '0x14a21b4396d0' log.txt                                                         
[0 - 14a304061000]   18.175319 {2}{dma}: dma request 0x14a21b4396d0 created - plan=0x14a21b439210 before=8000002001b0034f after=8000002003f00003
[0 - 14a2fcb01000]   18.196459 {2}{dma}: dma request 0x14a21b4396d0 ready - plan=0x14a21b439210 before=8000002001b0034f after=8000002003f00003
[0 - 14a2fcc13000]   18.196517 {2}{dma}: dma request 0x14a21b4396d0 started - plan=0x14a21b439210 before=8000002001b0034f after=8000002003f00003

and

mebauer@ flecsi % grep '0x15224b19fe40' log.txt
[1 - 15233415e000]   18.175481 {2}{dma}: dma request 0x15224b19fe40 created - plan=0x15224b0d13f0 before=80008010031002d7 after=8000801000f00007
[1 - 152334382000]   18.196504 {2}{dma}: dma request 0x15224b19fe40 ready - plan=0x15224b0d13f0 before=80008010031002d7 after=8000801000f00007
[1 - 152334270000]   18.196530 {2}{dma}: dma request 0x15224b19fe40 started - plan=0x15224b0d13f0 before=80008010031002d7 after=8000801000f00007
apryakhin commented 1 month ago

Just a clue, there are 2 DMA requests that have started but have not completed (perhaps it is a deadlock after all, but inside Realm and not at the Legion level)

This are both GPU gather copies

apryakhin commented 1 month ago

We also found that this only happens when two MPI ranks are sharing one GPU

Something maybe I could start with.

So the hang isn't from the GPU reductions but from GPU scatter/gather operations. GPU reductions likely contribute to the timing somehow which shakes loose some of the things.

This definitely is a Realm bug and most likely a GPU scatter/gather bug.

The easiest would be probably be for me set a reproducer

apryakhin commented 1 month ago

I am working on a reproducer right now by reconstructing the operation from the logs. I have one theory that we incorrectly handle staging buffers where some of the remote requests just never arrive. That's just a theory though..I will probably have an answer and a fix by the end of this week.

jpietarilagraham commented 1 month ago

I can confirm setting to false also fixes my hang/deadlock for a medium size mesth (373k cells) with running 3 GPUs on 3 ranks. It is quite a bit slower than 2 or 4, but it does complete. At 1M cells, I get stuck again.

apryakhin commented 1 month ago

There is a bug in the size alignment of the staging buffer. Essentially gpu gather moves first the indirection through the IB and then stages results of the actual gather through another IB. The size of the second IB isn't aligned with the field_size=24. That leads to the under counting of the actual data points and therefore the gpu gather never completes.

The fix is in the local branch already. I am going to sit on it for some short time to make sure that's the right fix and that there is the right test for it.

tukss commented 1 month ago

Thanks. That is amazing. I'm happy to test it out on our problem.

apryakhin commented 1 month ago

@tukss Let's assume this is not a production grade patch yet but a quick prototype to test out a theory.

Can you please cherry pick changes from this PR and confirm whether it fixes the hang as well as that you are getting good data?

It's likely that we will change how we fix it but just want to make sure that the initial theory about your hang is correct.

tukss commented 1 month ago

I can confirm that your patch applied on top of Legion 24.03.0 makes the difference. The hang goes away when it's applied. I'll run our full test suite tomorrow to make sure it's also giving the right data.

tukss commented 1 month ago

While the small reproducer worked fine, the full code now crashes with this error:

[1 - 14a3a2c13000]   10.544269 {6}{gpu}: CUDA error reported on GPU 0: an illegal memory access was encountered (CUDA_ERROR_ILLEGAL_ADDRESS)
[...] [...]/legion/runtime/realm/cuda/cuda_module.cc:401: bool Realm::Cuda::GPUStream::reap_events(TimeLimit): Assertion `0' failed.

Sometimes I get this error instead:

[3 - 1503f47e1000]   11.680272 {6}{dma}: no piece found for <4331796996535293445,4319991934325413831>

Any ideas? This happens after one successful iteration in our code, so not immediately. This only happens if I run with more than one MPI task and all the ranks are bound to the same GPU.

apryakhin commented 1 month ago

@tukss Could you please give me the logs up to the point of the crash? -level gpu=1,gpudma=1,dma=1,xplan=1,xpath=1

tukss commented 1 month ago

Sure. Here are the logs @apryakhin.

crashlog.txt

apryakhin commented 1 month ago

Just I an update I am looking at this now..and I will update as soon as I have it solved. Possibly with another patch for a test run.

apryakhin commented 1 month ago

@tukss I just pushed another commit to the same branch/patch. Would you be able to give it another go please? Both on smaller reproducer and the full run. That should address the first failure.

As for the second one - let's see if we still hit this.

tukss commented 1 month ago

@apryakhin This seems to fix the problems. Both the small reproducer and the full code run fine applying both commits from your PR. Thank you!

apryakhin commented 1 month ago

@tukss Thanks for verifying it. The patch will have to go through the peer view (hopefully this week) and I will make sure we get it merged for 24.06.

apryakhin commented 3 weeks ago

@tukss Can you please do another run on the following branch?

It's a better patch which passes on the tests I wrote that suppose to replicate the type of operations you are running. However, since I can't run your exact tests directly just want to make sure it fixes everything for you.

tukss commented 3 weeks ago

@apryakhin I've run the apriakhin/gather-align-ib branch of Legion and everything still works fine. Thank you!

apryakhin commented 3 weeks ago

@tukss Thanks for being proactive. We have had a large commit done to master yesterday which touches some of the code path that is related to the issue you are seeing. This commit doesn't include our patch though but other changes. Could you please re-run the tests one more time off the clean master branch build and check if the issue still persists? That would probably be the last test run we to need.

tukss commented 2 weeks ago

I've just tested with current master (71d9d2541283b3c660a2dd1f162fd2bf607db965). All our test cases are passing with this version and there are no hangs.

apryakhin commented 2 weeks ago

Thanks! I think we should just close this issue then.

tukss commented 2 weeks ago

Agreed. Thanks for the help!