distributed-system-analysis / pbench

A benchmarking and performance analysis framework
http://distributed-system-analysis.github.io/pbench/
GNU General Public License v3.0
186 stars 108 forks source link

An assortment of Pbench Ops fixes and fun #3612

Closed dbutenhof closed 5 months ago

dbutenhof commented 6 months ago

This fixes several issues observed during ops review:

  1. The /api/v1/endpoints API fails if the server is shut down
  2. tar unpack errors can result in enormous stderr output, which is captured in the Audit log; truncate it to 5Kb
  3. Change the pbench-audit utility to use dateutil.parser instead of click.DateTime() so we can include fractional seconds and timezone.

During the time when we broke PostgreSQL, we failed to create metadata for a number of datasets that were allowed to upload. (Whether we should allow this vs failing the upload is a separate issue.) We have want to repair the excessively large Audit attributes records. So I took a stab at some wondrous and magical SQL queries and hackery to begin a new pbench-repair utility. Right now, it repairs long audit attributes "intelligently" by trimming individual JSON key values; and it add metadata to datasets which lack critical values. Currently, this includes server.tarball-path (which we need to enable TOC and visualization), dataset.metalog (capturing the tarball metadata.log file), and server.benchmark for visualization.

There are other server namespace values (including expiration time) that could be repaired: I decided not to worry about that as we're not doing expiration anyway. (Though I might add it over the weekend, since it shouldn't be hard.) And there are probably other things we might want to repair in the future using this framework.

I tested this in a runlocal container, using psql to "break" datasets and repair them. I hacked the local repair.py with a low "max error" limit to force truncation of audit attributes:

pbench-repair --detail --errors --verify --progress 10
(22:52:08) Repairing audit
|| 60:FAILURE upload fio_rw_2018.02.01T22.40.57 [message] truncated (107) to 105
|| 116:SUCCESS apikey None [key] truncated (197) to 105
22 audit records had attributes too long
2 records were fixed
(22:52:08) Repairing metadata
|| fio_rw_2018.02.01T22.40.57 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_rw_2018.02.01T22.40.57.tar.xz
|| fio_rw_2018.02.01T22.40.57 has no metalog: setting from metadata.log
|| fio_rw_2018.02.01T22.40.57 has no server.benchmark: setting 'fio'
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/ansible-host/45f0e2af41977b89e07bae4303dc9972/pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18.tar.xz
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no metalog: setting from metadata.log
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no server.benchmark: setting 'pbench-user-benchmark'
2 server.tarball-path repairs, 0 failures
2 dataset.metalog repairs, 0 failures
2 server.benchmark repairs
dbutenhof commented 6 months ago

FYI:

I faked broken metadata by using psql to delete some server and metalog rows:

|| Missing MD5 /srv/pbench/archive/fs-version-001/ansible-host/45f0e2af41977b89e07bae4303dc9972/pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18.tar.xz.md5
|| Isolator directory /srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa contains multiple tarballs: ['/srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_rw_2018.02.01T22.40.57.tar.xz', '/srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_mock_2020.02.27T22.16.14.tar.xz']
(16:01:28) Found ['/srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_rw_2018.02.01T22.40.57.tar.xz', '/srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_mock_2020.02.27T22.16.14.tar.xz'] for ID 08516cc7448035be2cc502f0517783fa
|| fio_rw_2018.02.01T22.40.57 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_rw_2018.02.01T22.40.57.tar.xz
|| fio_rw_2018.02.01T22.40.57 has no metalog: setting from metadata.log
|| fio_rw_2018.02.01T22.40.57 server.deletion set (730 days) to 2026-03-12T15:20:34.380181+00:00
|| fio_rw_2018.02.01T22.40.57 has no server.benchmark: setting 'fio'
(16:01:29) Found /srv/pbench/archive/fs-version-001/dhcp31-44.perf.lab.eng.bos.redhat.com/22a4bc5748b920c6ce271eb68f08d91c/fio_rw_2018.02.01T22.40.57.tar.xz for ID 22a4bc5748b920c6ce271eb68f08d91c
|| fio_rw_2018.02.01T22.40.57 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/dhcp31-44.perf.lab.eng.bos.redhat.com/22a4bc5748b920c6ce271eb68f08d91c/fio_rw_2018.02.01T22.40.57.tar.xz
|| fio_rw_2018.02.01T22.40.57 has no metalog: setting from metadata.log
|| fio_rw_2018.02.01T22.40.57 server.deletion set (730 days) to 2026-03-12T15:20:33.301420+00:00
|| fio_rw_2018.02.01T22.40.57 has no server.benchmark: setting 'fio'
|| Missing MD5 /srv/pbench/archive/fs-version-001/ansible-host/45f0e2af41977b89e07bae4303dc9972/pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18.tar.xz.md5
|| Isolated tarball /srv/pbench/archive/fs-version-001/ansible-host/45f0e2af41977b89e07bae4303dc9972/pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18.tar.xz MD5 doesn't match isolator 45f0e2af41977b89e07bae4303dc9972
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 doesn't seem to have a tarball
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no metalog: setting from default
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 server.deletion set (730 days) to 2026-03-12T15:20:33.441340+00:00
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no server.benchmark: setting 'unknown'
|| Missing MD5 /srv/pbench/archive/fs-version-001/ansible-host/45f0e2af41977b89e07bae4303dc9972/pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18.tar.xz.md5
(16:01:29) Found /srv/pbench/archive/fs-version-001/rhel8-1/4b8da5832aa9c7c6a21dc74123b8968b/uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57.tar.xz for ID 4b8da5832aa9c7c6a21dc74123b8968b
|| uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/rhel8-1/4b8da5832aa9c7c6a21dc74123b8968b/uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57.tar.xz
|| uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57 has no metalog: setting from metadata.log
|| uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57 server.deletion set (730 days) to 2026-03-12T15:20:33.609509+00:00
|| uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57 has no server.benchmark: setting 'uperf'
4 server.tarball-path repairs, 1 failures
4 server.deletion repairs, 0 failures
4 dataset.metalog repairs, 0 failures
4 server.benchmark repairs