StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
663 stars 146 forks source link

Random "poisoned deferred unlock skipped" #1631

Open mariodirenzo opened 5 months ago

mariodirenzo commented 5 months ago

I am randomly seeing the following message on multinode executions of HTR

[1 - 2000575df8b0]  542.495878 {4}{poison}: poisoned deferred unlock skipped - POSSIBLE HANG - lock=1f00010000000048

After I see this message the execution hangs. How can I debug this issue?

@elliottslaughter Can you please add this issue to #1032 ?

elliottslaughter commented 5 months ago

Maybe some Realm people can comment. @eddy16112 @artempriakhin

lightsighter commented 5 months ago

Legion does do potentially poisoned reservation acquires/releases but they are always matched, meaning that if the acquire is poisoned then the release is poisoned too so neither of them should run. It shouldn't be possible to get an acquire that is not poisoned and release that is which could possible lead to a hang.

lightsighter commented 5 months ago

I should add though that this assumes that the poison propagates through a reservation acquire. If the poison doesn't propagate through a reservation acquire then the way Legion is doing things is not going to work.

mariodirenzo commented 5 months ago

Is there any update on this issue?

lightsighter commented 5 months ago

I added it to the Realm meeting agenda for next week.

apryakhin commented 5 months ago

I am going to take a look at it. I remember sending out a related patch to fix the following:

Could be unrelated or could be the bug I introduced

apryakhin commented 5 months ago

Realm should't be acquiring the reservation when Reservation::acquire called with the poisoned precondition in the first place. Unless we are running into a race condition with the following:

      ...
      bool poisoned = false;                                                
      if(wait_on.has_triggered_faultaware(poisoned)) {                      
        if(poisoned) {                                                      
          log_reservation.info()                                            
              << "reservation:" << *this                                    
              << " cannot be acquired due to poisoned precondition finish=" << wait_on;
          return wait_on;                                                   
        } else {                                                                                                                                                                                                          
          Event e = get_runtime()->get_lock_impl(*this)->acquire(           
              mode, exclusive, ReservationImpl::ACQUIRE_BLOCKING);          
          log_reservation.info()                                            
              << "reservation acquire: rsrv=" << *this << " finish=" << e;
          return e;                                                         
        }
        ...                                                                   
apryakhin commented 5 months ago

@mariodirenzo Is that only with a multi-node setup? Would you be able to run whatever you are running with -level reservation=1?

mariodirenzo commented 5 months ago

I can't reproduce the issue on 1 node (at lease not deterministically as with 2 nodes). The logs for a two node run are at rsv_0.log and rsv_1.log

apryakhin commented 5 months ago

@mariodirenzo Do you have logs on sapling2? Can you share the path perhaps instead of me downloading the log files.

mariodirenzo commented 5 months ago

Sure, /home/mariodr/public_html/rsv_*

apryakhin commented 4 months ago

@mariodirenzo Is this blocking you at the moment? We haven't made any progress since the last two weeks on this. Is that's an urgent matter we can priorize this on our side.

We may need to apply a custom patch with additional logging to root cause the problem. Do you have think we will able apply it and rerun?

apryakhin commented 4 months ago

The best would probably be for us to have a reproducer to run on sapling2

mariodirenzo commented 4 months ago

I'll try to reproduce the problem on sapling. We can resort to applying the patch to the logging in case my attempts are not successful

mariodirenzo commented 4 months ago

I haven't been able to reproduce the issue on Sapling. I am unsure if this is system-related (though I can see it both on Lassen at LLNL and Leonardo at CINECA, which have different architectures) or simply because I am not running the calculation for long enough. At this point, I would apply a custom patch to the setups on Lassen that reproduce the problem more regularly and see if we can extract useful data out of the produced logs

elliottslaughter commented 4 months ago

@apryakhin What do we need to do for this? It's one of our last outstanding issues for the release.

apryakhin commented 4 months ago

@elliottslaughter Based on the data we have I can't confidently reason out where the problem is...at least in Realm. The presence of the "poisoned precondition" upon reservation release does not indicate the problem but merely states that the release of the reservation has been skipped (which must also mean that it wasn't acquired in the fist place).

There are two things we can do - we need either a reproducer on any system I can access in the form either realm, legion or htr test or a custom patch to collect more data for debugging.

I can prioritize working on this issue in next several days to make sure we have a root cause.

elliottslaughter commented 4 months ago

I don't know anything about the Leonardo system @mariodirenzo mentioned but I can tell you that getting access to LLNL systems will be slow (if it's possible at all). I believe @mariodirenzo was suggesting going the logging route.

Having said that, I wonder if it's worth just running the code longer on Sapling (possibly with some interfering processes) to see if we can make it happen there?

apryakhin commented 4 months ago

Thanks Elliot. I agree with your suggestion, but I believe @mariodirenzo has already attempted it without success.

I conducted further diagnostics on the logs and double-checked the data, leading to the following understanding:

We made approximately 6.7 million attempts to acquire roughly 16,000 different reservations over a certain period. As mentioned earlier, the "POSSIBLE HANG" message doesn't pinpoint the issue; it simply suggests that we skipped releasing the reservation, indicating a potential problem.

Here's an example of an invariant for a randomly selected reservation in the logs that remains consistent across all reservations

Number of reservation.acquire calls : 108883
Number of actual reservation acquires: 108612
Number of reservation.acquire dropped (poisoned) calls: 11
Number of reservation.release calls: 108894
Number of POSSIBLE HANGS..reservation.release dropped(deferred/poisoned) calls: 271 
ASSERT(reservation.acquire(108883) + reservation.acquire_dropped(11) == reservation.release(108894))
ASSERT(reservation.acquire(108883) - reservation.actual_acquire(108612) == reservation.release_dropped(271))

On a per-reservation basis, we're encountering a certain number of skipped releases (indicated by the "POSSIBLE HANG" message), which indicates that we're skipping a release. This count amounts to 271, suggesting that we should also observe 271 fewer actual reservation acquisitions; otherwise, it could lead to deadlock. By subtracting the number of actual reservation acquisitions from the total attempts (108883 - 108612 = 271), we obtain the expected count of failed attempts. Notably, this count precisely matches the number of "POSSIBLE HANG" messages, which alerts us to the skipped reservation releases.

This holds true for each reservation in the logs provided here.

Unfortunately, this isn't a root cause and doesn't pinpoint the exact problem. However, it does somewhat increase confidence that we're not simply overlooking any unlock calls or mistakenly acquiring one reservation while performing a poisoned_release_skip on another.

I could be mistaken, so I'll need to discuss this with the team during today's meeting to ensure I haven't overlooked anything. My intuition suggests that we still might be encountering an issue related to poisoned acquire/release semantics. I haven't delved into the exact event IDs used for deferred acquire/release, so there's a possibility something could be amiss there, although that would require a more detailed investigation.

elliottslaughter commented 4 months ago

Summarizing from our meeting today: we'll try to find someone with Lassen access to collaborate with @apryakhin so that we can reproduce this issue directly. @rohany volunteered to help out.

apryakhin commented 4 months ago

@muraj and @eddy16112 for visibility

seemamirch commented 3 months ago

@mariodirenzo - please provide instructions to reproduce this on lassen

mariodirenzo commented 3 months ago

I've already given the required files and instructions to @apryakhin via @rohany's account on Lassen

apryakhin commented 2 months ago

This issue has been open for a while now, and we need to reach some conclusion. We were able to set up a reproducer on Sapling (single node). The symptoms we observed are that the application does not make any progress. Specifically, it is expected to print the average value of the solution it is computing at every iteration. We did live debugging with GDB with @lightsighter on Sapling and can confirm that, whatever is going on, the runtime is doing some work. In other words, there is no hang.

The “POSSIBLE HANG” message reported at the beginning is unfortunately misleading, as it may report either the actual problem or false positives. In this case, it was the latter. I confirmed that by simply counting the number of lock/unlock operations on each reservation employed by Realm.

Now, two questions need to be answered. First, are the symptoms we see on Sapling the same as those that have been observed on Lassen (as initially reported)? In other words, there is no runtime hang while the application isn’t making any progress. Second, if that’s the case, why isn’t the application making progress? It seems that it has entered some form of loop, so we need to understand what it is doing.

lightsighter commented 2 months ago

@mariodirenzo To clarify a bit, it's very clear that Legion is actively running tasks for the application. Forward progress is being made. At no point does the application ever go into a hanging state. It's unclear to both me and Artem what is being done, but real application work is actively being accomplished. It may not be finishing for some reason at the application level, but it's not because it is hung. We need you to diagnose what is going on at the application level.