DARMA-tasking / vt

DARMA/vt => Virtual Transport
Other
33 stars 8 forks source link

Bug in communication JSON output #2279

Closed lifflander closed 4 weeks ago

lifflander commented 2 months ago

Describe the bug As @pierrepebay and @cwschilly were working on displaying the communication graph in vt-tv, we noticed that there were communication edges in the JSON LB data file output which linked to object IDs that did not exist as tasks. Either the communication link shouldn't exist or more likely a task should be output with zero time if that is what is happening. This is probably due to some bare handler link that is being output.

lifflander commented 1 month ago

@pierrepebay Please post JSON that is incorrect.

lifflander commented 1 month ago

@lifflander I will write unit test for this that reproduces.

lifflander commented 1 month ago

@cz4rs @pierrepebay

Here is a pointer to the incorrect JSON: https://github.com/DARMA-tasking/vt-tv/blob/master/tests/unit/lb_test_data/data.0.json

An entity with "home": 0, "id": 0, exists in the communications, but is not present in the tasks. This data was generated using lb_iter on 4 ranks, that should reproduce the bug. I think this bug is easy to reproduce (probably with multiple benchmarks).

lifflander commented 1 month ago

I believe that this code:

https://github.com/DARMA-tasking/vt/blob/1570dcfbe2673a23bf49c270b31d8898ab11282b/src/vt/messaging/active.cc#L153-L164

Is creating that entry, but it's not being output as a task because maybe it's zero work?

pierrepebay commented 1 month ago

@cz4rs additionally if this can help, when I encountered the bug I went through the data (the one @lifflander pointed to) for the 4 ranks by hand, and sketched this out: image It shows what is contained in the communications section of the json for each of the 4 ranks, and whether the endpoints of each communication are present in the tasks field for that rank or not (circled in green or red). It shows how object 0 doesn't exist in any of the ranks.

cz4rs commented 1 month ago

Output from lb_iter with downsized problem (running on two ranks), bare handlers are marked explicitly: data.1.json data.0.json

lifflander commented 1 month ago

So you added the bare_handler to this? Are you able to reproduce the bug?

cz4rs commented 1 month ago

So you added the bare_handler to this? Are you able to reproduce the bug?

Yes, it reproduces consistently.