Open therault opened 2 years ago
If we agree on 1. (require process-wide unique IDs for cross-thread matching events), we should add a check in dbpreader.c that will issue a warning or a fatal error if it sees an ID re-used in a process for events of a type that is detected as being cross-thread matching.
This scenario can indeed happen every time we reuse ID. The discussion was to move the complexity on the tool side, and match the event with the earliest completion event on all the other threads. This does indeed require a pre analysis to build for each thread the list of events out-of-sequence, but we can build this as we go.
Enforcing 1. could work iff there is a way to generate this unique id. It also creates 2 classes of profiling events, some that can be delocalized and some that cannot. The burden is then on the event creator to make sure it uses the right class.
Regarding 1., this is fairly trivial to implement, albeit with some runtime synchronization cost, namely fetch-and-add atomics. I already do this in the LCI comm engine, for fine-grained communication event tracking between the communication and progress threads—it works quite well.
I think the current phrasing is that IDs are "possibly unique". A better phrasing might be that IDs must be unique and not reused between the begin and and events for a particular event type.
it is not only about how to generate the unique ID, but also how to store it in the code in order to reuse it later on.
it is not only about how to generate the unique ID, but also how to store it in the code in order to reuse it later on.
Right—in the LCI case, I added it to the structure that tracks event status/completion, so it could be reused from there. I expect on most of these asynchronous use cases there is some structure that contains information about the event, and the ID can be added there.
For options 2/3 it wouldn't be too difficult to change #372 to sort events by time from all streams rather than per-stream, albeit with an additional time complexity cost of merging O(s) lists, where s is the number of streams. Alternatively, it could do a binary search over all O(s) streams and then a linear search over the s matches to find the minimum timestamp. The former suggestion has a time complexity of O(s n) for the merging, where n is the mean number of buffers with possibly-matching end events in each stream, and has an O(log s + log n) time complexity for finding the event, while the latter doesn't have any additional startup cost, but finding the correct matching event is O(s + s log n).
That's exactly what I had in mind, extending #372 (now merged) to gracefully handle these cases, and remove the burden from the user.
I'll note that I disagree with option 2, not because I think it'll be overly costly (though it might be), but because it changes the current semantics that cross-stream events are considered only if a match in the current stream is not found; I'd rather that cross-stream events be opt-in behavior.
In particular, I think that always doing cross-stream matching might break the PAPI profiling, since if I recall correctly it uses a static ID and relies on matching in the same stream for correctness.
It looks like we need to support both kinds of semantics then, and that only the user can decide if this type of event is cross-thread matching or per-thread matching.
The current API assumes per-thread matching unless this fails, but as we have seen in the counter-example above, 'unless it fails' does not work for cross-thread matching events, as it could not fail, but still be wrong (stream 0 allocation can match with stream 0 free, but it shouldn't).
So, a solution based on a user-defined flag, at dictionary creation time or at event profiling time would be a good compromise, and if we have this flag, we need to do matching cross-threads, even for the events that have an apparent match on the same thread.
Right—we either force the user to maintain ID uniqueness themselves for events that might match across streams i.e. ensuring that it doesn't match in the current stream or events/event types can decide what matching behavior is required and the burden is shifted to the performance analysis/conversion tools.
In terms of implementation, #372 makes an optimizing assumption (per @bosilca) that most begin events are immediately followed by the matching end event, and so only end events that do not immediately follow a matching begin event are recorded as events that can possibly match cross-stream. This optimization cannot, I think, be made for events specifically marked as cross-stream events.
Right, we don’t have the automatic fallback from the optimized path, because there is no guarantee that the supposedly matching end event on the same stream is the correct one. Thus, this escalates quickly, and we are forced to do a global check for most events. In fact, I think we need to do a global check for all events that do not nicely embed (aka a start-end always inside another start-end).
Let’s go for the simplest solution: events should have unique keys.
Description
When debugging memory leaks, I used the returned pointer as the ID for events.
That pointer is temporally unique (at any given time, it represents a single allocation), but it is often re-used during the execution.
Now, consider a case when a thread can free the allocation of another thread. We can have the following scenario:
The inter-stream matching of the profiling will do the following:
This is because of the strategy 'find the matching end on the same stream first, and if not, then find any matching event that occurs after the current timestamp on another stream'.
That strategy is also flawed in this scenario, if we add a stream 2, that ends the allocation of ID 1 from stream 1, we will still match the end with stream 0 because we tried stream 0 first...
Describe the solution you'd like
To properly match end events when reusing the IDs, the only correct solution is to BFS the matching end on ALL threads simultaneously (advancing on the global ordering of events by timestamp). ALL events (including the ones that have their matching end on the same thread) will have to follow the same strategy. Even with the acceleration for lookup in PR #372, this will probably too costly.
We could mark the events as 'REUSE_ID', to tell the parser that those need BFS matching to circumvent that problem...
I think the best solution is to forbid this behavior: if cross-thread matching is necessary, the user cannot re-use IDs during the run, and matching IDs must be unique. The documentation should reflect this.
Describe alternatives you've considered
Additional context
Memory allocation tracking in the PaRSEC engine also suffers from that problem... For my use case in TTG, I can easily fix by providing a run-wise unique ID for each allocation.