The /datasets/{id}/contents API includes into several unexpectedly expensive steps:
Finding the tarball (by MD5 value) within the ARCHIVE tree using a glob
Fully discovering all tarballs within the controller directory
Unpacking the tarball into a cache directory using tar
Building a "map" of the contents of the unpacked tarball subtree
This PR includes mitigations for all but the tar unpack step:
Use the server.tarball-path metadata instead of searching the disk
Only discover the target tarball rather than the entire controller
Skip the "map" and evaluate the actual target path within the cache
Finding a tarball within our 30Tb ARCHIVE tree can take many minutes, while identifying the controller directory from the tarball path takes a fraction of a second.
Depending on the number of tarballs within a controller (some have many), full controller discovery has been observed to take half a minute; while populating only the target tarball takes a fraction of a second.
Building the map for a large tarball tree can take minutes, whereas discovery of the actual relative file path within the cache runs at native (Python) file system speeds.
PBENCH-1321
The
/datasets/{id}/contents
API includes into several unexpectedly expensive steps:ARCHIVE
tree using aglob
tar
This PR includes mitigations for all but the
tar
unpack step:server.tarball-path
metadata instead of searching the diskFinding a tarball within our 30Tb
ARCHIVE
tree can take many minutes, while identifying the controller directory from the tarball path takes a fraction of a second.Depending on the number of tarballs within a controller (some have many), full controller discovery has been observed to take half a minute; while populating only the target tarball takes a fraction of a second.
Building the map for a large tarball tree can take minutes, whereas discovery of the actual relative file path within the cache runs at native (Python) file system speeds.