distributed-system-analysis / pbench

A benchmarking and performance analysis framework
http://distributed-system-analysis.github.io/pbench/
GNU General Public License v3.0
186 stars 108 forks source link

Operations cleanup #3589

Closed dbutenhof closed 9 months ago

dbutenhof commented 9 months ago

This addresses several issues encountered while monitoring the migration of tarballs from the passthrough server backup directories to the new production server.

First, I've seen PUT /upload problems more frequently than anticipated, and when transferring thousands of tarballs the error details get easily hidden: I've improved the way they're captured and reported at the end. Also, having observed many of the NGINX html format response messages, I decided to try scraping the text for the <title> tag text, which seems to contain the real error message, using BeautifulSoup.

Second, I ran into a set of tarballs from 2020 which seem to have metadata.log files which don't contain run.controller values. These, it turns out, fall into a hole in intake processing. Without a metadata.log at all, we just ignore the problem and use a default "controller" of unknown, but if the specific value is missing we fail the upload entirely with a poorly worded error message. It makes more sense to treat a missing run.controller the same way as a missing metadata.log.

Third, I've seen indexing failures on large "batches" (trying to index thousands of datasets in one run of the indexer) blowing up with memory problems that don't reproduce. Although it's not obvious from glancing through the main indexer loop, it seems likely there's a memory leak somewhere that's gradually building up. Since I can't find it (and I'm on vacation, so I didn't look excessively hard), I took another approach I'd considered earlier anyway and rejiggered the Sync.update to allow adding a SQL LIMIT to the query for READY datasets. This shouldn't have much impact on throughput as the indexer is serial and restarts every minute if it's not already/still busy, but it may keep the memory buildup below the danger threshold.

Only the migration utility changes have actually been tested "live", but the tests run.