Support for collection of backtrace memory addresses

hammad45 commented 10 months ago

Added support for collecting backtrace memory addresses using backtrace () and backtrace_symbols ()
Get address-to-line mappings using addr2line for the unique memory addresses corresponding to the binary
Modified Darshan logs to include the address-to-line mappings as part of the Darshan header and the complete memory addresses stack as part of the DXT trace data

jakobluettgau commented 10 months ago

Hi Hammad, this looks really nice. I'll try to create some logs with this new mode for DXT in MPI and POSIX as well, but could you share one of your logs for testing too?

Also since this appears to change the log format, it should progress the log format versions, for example, DXT_*_VER for the affected modules in darshan-dxt-log-format.h.

I'll try to run some tests and get back with additional feedback.

jakobluettgau commented 9 months ago

It looks like this does regress for old darshan logs, it should not be a big deal to support both, but as is old logs will error out both for darshan-parser and darshan-dxt-parser, as well as pydarshan:

Error: failed to read darshan log file header. Error: darshan_log_open failed to read darshan log file header: Success.

jakobluettgau commented 9 months ago

I guess a small paragraph for the documentation might be helpful as well. Something along the lines of:

Target application needs to be compiled with debugging symbols (-g) otherwise line mappings are less meaningful and just show ??
To collect backtrace information, a new environment variable has to be set export DXT_ENABLE_STACK_TRACE=1
Maybe a reference to online man pages of backtrace and addr2line for an interested user
And maybe at some point with more experience an expectation of added overhead when enabled

Maybe some other noteworthy remarks from your experience when implementing this :)

shanedsnyder commented 9 months ago

Hi Hammad,

Thanks for submitting this PR!

Could you provide some detailed comments/discussion on how exactly the stack traces are collected with this code? I think it would take me some time to grok all the code changes, but it will be easier if I'm able to better understand how this process is intended to be carried out. From a relatively quick first scan, it seems:

Processes independently capture stacktrace info as read/write calls come into DXT
At DXT module shutdown time, information related to these stacktraces is extracted and written to per-process files
At Darshan shutdown time, rank 0 serially reads each per-rank file, extracts/transforms the data, then writes the resulting output data into the Darshan header

Any more elaborations there would be very welcome.

Without understanding the full changes yet, I do have a couple of higher level concerns:

Ultimately storing this stack data in the Darshan log header is almost certainly not what we want to do
- The header is a small, uncompressed region of the Darshan log file to store compact metadata about the modules (i.e., their version, how much compressed data they wrote, etc.), so it's not really where we'd imagine storing big chunks of characterization data
- If we can't store the stack traces alongside the trace segments captured by the DXT modules, I think I'd recommend we create an entirely new module (e.g., DXT_STACKS) that stores this info
The shutdown process seems pretty inefficient. It looks like the DXT module on each process writes out it's own file at module shutdown time, but then as Darshan is shutting down and writing it's log file it has to have rank 0 read each of these per-rank files serially
- Could we just use MPI collective operations at module shutdown time to reduce all of the stack data to rank 0? I'd guess that will be much more efficient than serializing all of this through the file system.

darshan-hpc / darshan

Support for collection of backtrace memory addresses #966