distributed-system-analysis / pbench

A benchmarking and performance analysis framework
http://distributed-system-analysis.github.io/pbench/
GNU General Public License v3.0
188 stars 108 forks source link

Make `contents` API scale #3609

Closed dbutenhof closed 9 months ago

dbutenhof commented 9 months ago

PBENCH-1321

The /datasets/{id}/contents API includes into several unexpectedly expensive steps:

  1. Finding the tarball (by MD5 value) within the ARCHIVE tree using a glob
  2. Fully discovering all tarballs within the controller directory
  3. Unpacking the tarball into a cache directory using tar
  4. Building a "map" of the contents of the unpacked tarball subtree

This PR includes mitigations for all but the tar unpack step:

  1. Use the server.tarball-path metadata instead of searching the disk
  2. Only discover the target tarball rather than the entire controller
  3. Skip the "map" and evaluate the actual target path within the cache

Finding a tarball within our 30Tb ARCHIVE tree can take many minutes, while identifying the controller directory from the tarball path takes a fraction of a second.

Depending on the number of tarballs within a controller (some have many), full controller discovery has been observed to take half a minute; while populating only the target tarball takes a fraction of a second.

Building the map for a large tarball tree can take minutes, whereas discovery of the actual relative file path within the cache runs at native (Python) file system speeds.

dbutenhof commented 9 months ago

Bad coverage means more test cases. And more test cases means ... more edge cases to fix. 😆