darshan-hpc / darshan

Darshan I/O characterization tool
Other
56 stars 27 forks source link

MAINT: properly handle anonymized files on `/` filesystem in "data access by category" #686

Open shanedsnyder opened 2 years ago

shanedsnyder commented 2 years ago

Split off from https://github.com/darshan-hpc/darshan/pull/397, there were some questions about strangely formatted log record names in some anonymized logs we look at frequently (imbalanced-io example in the darshan-logs repo). E.g., a record named //2446001947.

A bit strange, but I don't think there's actually any issues that have to be resolved in darshan-logutils library or in the pydarshan bindings. First, consider the list of mount points that Darshan captured at runtime for this job, from darshan-parser:

# mount entry:  /var/opt/cray/imps-image-binding/diags/squash_mounts/squashfs_pTbYoK_mount_point        squashfs
# mount entry:  /var/opt/cray/imps-image-binding/PE/squash_mounts/squashfs_4e4XkQ_mount_point   squashfs
# mount entry:  /var/opt/cray/imps-distribution/squash/mounts/global    squashfs
... <chopped for brevity> ...
# mount entry:  /home_cray      dvs
# mount entry:  /opt/gcc        squashfs
# mount entry:  /opt/R  squashfs
# mount entry:  /dev    devtmpfs
# mount entry:  /       overlay

What darshan-parser does, is try to match file record names it encounters in the log with the mount point they are associated with. So, you would start at the top of the table (the longest mount points) and work your way down to the bottom (shortest mounts), and find the first one that matches the record name path prefix. That is the mount point the file was associated with (with these mount points being the different categories we want to provide info on in these plots).

So, you can see what darshan-parser thinks of these records here:

#<module>       <rank>  <record id>     <counter>       <value> <file name>     <mount pt>      <fs type>
POSIX   0       301664950237594445      POSIX_OPENS     1       //2446001947    /       overlay
POSIX   0       301664950237594445      POSIX_FILENOS   0       //2446001947    /       overlay
POSIX   0       301664950237594445      POSIX_DUPS      0       //2446001947    /       overlay
POSIX   0       301664950237594445      POSIX_READS     144     //2446001947    /       overlay

For these particular files, the very last mount point (root /) is the only one that matches the file record name, so they are associated with that mount. So, we need to make sure the "data access by category" table is properly able to match these records to the / mount, and that this mount point has a row in the final table.

For more context, as part of the log anonymization process, we always hash the part of the file record name after the mount prefix to make sure any identifying details of users are hidden (mount points are fine to capture in full, but we don't want to retain paths within mount points that could identify users, if that makes sense). So, any sort of record names that get matched to / mount will always have this sort of form to them, and that's fine.

shanedsnyder commented 2 years ago

That said, there is a tiny bug in Darshan's anonymization code (I'll open another issue) that is resulting in a double / prefix for anonymized files on the root filesystem /. So, these record names should just simply be something like /2446001947. It could be that this // is causing some problems with the code for matching mounts to file paths? If so, we should think about how to rectify that.

Generally speaking, our utilities should be resilient to file paths with repeated / symbols, as these are actually valid file paths. darshan-runtime library will make sure to collapse repeated / into a single one, but as you can see the anonymization code did not do this. Even if we fix the anonymization code, we can assume logs already exist that have this issue. Given that, we should probably add some code into darshan-util library to condense down repeated / so that different utilities don't have to have special handling for this.