This PR revamps Darshan's Lustre module to account for new Lustre features, including progressive file layouts (PFL), file level redundancy (FLR), data-on-MDT (DOM), and self extending layouts (SEL).
This update is mainly to modify Lustre file records so that they can describe multi-component Lustre layouts. To support this, Lustre module file records now have the following format:
base_record; # Darshan record ID + rank
num_comps; # number of Lustre components
comp_list; # list of fixed-length component descriptions
ost_list; # list of OST IDs associated with components above
For each component, Darshan collects the following parameters:
We are using essentially the same Lustre API we have used in recent versions for querying striping info (i.e., llapi_layout_get_by_xattr(), llapi_layout_stripe_size_get(), etc.), but we have updated this code to be aware that file layouts are composed of 1 or more components (i.e., using llapi_layout_comp_use()) and to query more info (e.g., llapi_layout_comp_extent_get()).
Some more detailed comments on these changes:
Lustre info is now gathered at close() time in POSIX/STDIO modules. This is because some component info (i.e., the OST list) is only initialized by Lustre after the corresponding file extent is accessed, which means we may not be able to capture these details if they were captured at open() time.
To support variable-length records, the Lustre module first iterates all layout components to determine the maximum record size needed and requests this amount of memory from Darshan. Each time the file is closed, Darshan fills this record buffer with details of currently active components (which could be fewer components than the max we initially calculated).
Special care is taken to allow indexing into variable-length component and OST lists in a single compact record buffer.
At shutdown time, these "holes" between Lustre records (e.g., caused when our captured Lustre records are smaller than the maximum record size we initially requested) are corrected by re-serializing records into the output buffer. (This code is also used to drop shared records on non-zero ranks, which is a lot less convoluted than the previous code which attempted to sort these variable-length Lustre records).
There are a few counters in the old Lustre module that are no longer supported:
LUSTRE_OSTS/LUSTRE_MDTS: I don't see Lustre APIs to get these and would like to move on from relying on crufty ioctls like we currently use -- besides, these values apply to an entire Lustre mount (and likely don't change often, if at all, throughout deployment) so kind of overkill to store for each file.
LUSTRE_STRIPE_OFFSET: Probably should have been removed awhile back, but this is not useful at all, particularly since we capture the entire OST list associated with each file.
PR is marked WIP until the following is resolved:
[x] backwards compatibility logic in darshan-util to up-convert old Lustre records to new format version
[x] fill in remainder of logutils functions
[ ] update DXT parsing code to account for new striping mechanics
[x] updated Python bindings to account for new Lustre module format
[x] updated darshan-util docs to describe new Lustre format
This PR revamps Darshan's Lustre module to account for new Lustre features, including progressive file layouts (PFL), file level redundancy (FLR), data-on-MDT (DOM), and self extending layouts (SEL).
This update is mainly to modify Lustre file records so that they can describe multi-component Lustre layouts. To support this, Lustre module file records now have the following format:
For each component, Darshan collects the following parameters:
We are using essentially the same Lustre API we have used in recent versions for querying striping info (i.e.,
llapi_layout_get_by_xattr()
,llapi_layout_stripe_size_get()
, etc.), but we have updated this code to be aware that file layouts are composed of 1 or more components (i.e., usingllapi_layout_comp_use()
) and to query more info (e.g.,llapi_layout_comp_extent_get()
).Some more detailed comments on these changes:
close()
time in POSIX/STDIO modules. This is because some component info (i.e., the OST list) is only initialized by Lustre after the corresponding file extent is accessed, which means we may not be able to capture these details if they were captured atopen()
time.LUSTRE_OSTS
/LUSTRE_MDTS
: I don't see Lustre APIs to get these and would like to move on from relying on crufty ioctls like we currently use -- besides, these values apply to an entire Lustre mount (and likely don't change often, if at all, throughout deployment) so kind of overkill to store for each file.LUSTRE_STRIPE_OFFSET
: Probably should have been removed awhile back, but this is not useful at all, particularly since we capture the entire OST list associated with each file.PR is marked WIP until the following is resolved: