darshan-hpc / darshan

Darshan I/O characterization tool
Other
56 stars 28 forks source link

Specifying process rank and world size for non-MPI distributed applications #1006

Open Technohacker opened 2 months ago

Technohacker commented 2 months ago

When Darshan operates in non-MPI mode, it assumes a single process in a "world". However, there may be other non-MPI distributed environments where this assumption doesn't hold. This is particularly problematic when merging such traces with darshan-merge.

I suppose a way to manually specify these arguments using environment variables could be implemented, like for example DARSHAN_PROCESS_RANK_VAR=RANK to have Darshan load the process' rank from the env var $RANK.

For my particular situation, I'm trying to trace I/O accesses for a distributed PyTorch script. The distribution backend used is NCCL, so it has to be traced with Darshan's non-MPI mode. Torch launches separate processes for each GPU, but Darshan considers each process independent of each other.

shanedsnyder commented 2 months ago

Part of the complexity with this issue is that for multi-process MPI jobs, Darshan is condensing all instrumentation data down into a single log file, using MPI collective routines (for log file I/O, for reducing shared data, etc.). This generally all occurs at runtime, right before the app exits.

For non-MPI distributed frameworks, like NCCL, even if you could get the runtime environment to provide a rank env var like you mention, we still don't have the corresponding MPI functionality to communicate and reduce instrumentation data across ranks, so it wouldn't be helpful other than for persisting a different rank ID in each process's log file.

FWIW, darshan-merge was originally written for solving a pretty specific problem: Essentially, there's ways you could configure Darshan to generate per-process logs (like what you have in NCCL) for MPI applications that terminate abruptly and don't go through Darshan's traditional shutdown procedure for MPI apps -- darshan-merge can take those individual per-rank log files and aggregate them in a way that makes it look like a traditional Darshan log (i.e., one log for all MPI ranks). So, in some sense it's very much like what you're looking for, but it relies on MPI ranks set in the log files to work properly as you are seeing. We don't really recommend using darshan-merge for combining Darshan log data for this reason, as it wasn't intended to be used for general aggregation like this.

We've actually been thinking about this sort of problem a lot recently, trying to come up with better analysis tools for multiprocess frameworks exactly like you describe here. We'd like to solve this by redesigning our old analysis tools (or writing entirely new ones) that can operate on multiple Darshan logs as input, rather than try to find creative ways to combine independent Darshan logs together after the fact. I wish I had something concrete to share now, but it is something we're wanting to address in a future release. In the meantime, this sort of aggregation is something you'll have to do manually, by either modifying darshan-merge or by writing your own custom analysis tool.