darshan-hpc / darshan

Darshan I/O characterization tool
Other
55 stars 27 forks source link

Heatmap time skew problem #955

Closed Nafi3 closed 10 months ago

Nafi3 commented 10 months ago

Hi! Following Issue #941 and fix #945 I have encountered some logs that utils is still unable to parse.

carns commented 10 months ago

Thanks @Nafi3 ! Confirmed that I can reproduce as well.

I'd like to look at the raw data with darshan-parser, but it's crashing on the Lustre module data, which I've seen before on some of your other logs. I'm going to hack around that for now to get to the heatmap issue, but just noting that we need to follow up on that as well. I might use your log examples in another issue shortly so we can track that problem.

carns commented 10 months ago

Ok, the problem with the heatmap data is this: the heatmap skew parsing workaround that we added in 3.4.4 intentionally only normalizes the number of bins in each heatmap record if they are off by exactly 1. This was intentional, I had hoped that the Darshan shutdown would be synchronized at least enough that there would be minimal skew.

The first log in the zip file has some ranks with 8 more heatmap bins than some of the other ranks. Some ranks have up to 106 bins while others only have 98. The bin width is .1 seconds in this log, meaning that some ranks shut down the heatmap module .8 seconds later than others.

I'm going to look at the code a little more and think about this; I'm not sure if it is a good idea to try to normalize these logs if the skew is is arbitrarily big.

(for reference for anyone following this issue; the root cause of the skew when logs are generated has already been fixed in #942 and released in Darshan 3.4.2; the issue here is if we can repair data in logs that triggered this issue previously)

carns commented 10 months ago

After some offline discussion, we've decided its best not to try to normalize the logs that are this skewed.

The separate issue with the Lustre module data is being tracked in #956 .