cloudigrade / cloudigrade

A tool for tracking and reporting RHEL usage in public clouds
https://cloudigra.de
GNU General Public License v3.0
8 stars 7 forks source link

Upgrade to UBI9 (carefully!!) #1314

Closed infinitewarp closed 2 years ago

infinitewarp commented 2 years ago

Summary

As a cloudigrade sysop, I want cloudigrade running the latest supported UBI because it has all the security patches! We are currently stuck on UBI8 due to performance regressions found in production when we upgraded to UBI9.

We need to run some kind of performance testing on cloudigrade with UBI9 before we merge, tag, etc.

The performance issues we saw in prod with UBI9 were also probably intertwined with Watchtower logging. At the time, we were using Watchtower with use_queues set to False (in all but the API pods) due to multiprocessing compatibility problems with Celery. We fixed that underlying multiprocessing issue by giving Celery new arguments to force the worker pods to have exactly only one process. This allowed us to enable use_queues and generally (even outside of the UBI8/9 concern) improved synchronous performance.

Since we largely addressed the Celery+multiprocessing+Watchtower performance problem in UBI8, will we still have problems in UBI9? Regardless, WHY did 9 trigger such poor performance? Was the mess with Watchtower use_queues really the only performance problem in UBI9? Could there be other problems lurking? An RCA would be very nice to have, but I'm not counting on one.

Acceptance Criteria

Assumptions and Questions

infinitewarp commented 2 years ago

Has jq been updated yet with UBI9? 🤞

infinitewarp commented 2 years ago

Notes for followups:

infinitewarp commented 2 years ago

To answer the earlier question about jq, it looks like the answer is no. The 1.6 build for ubi9 is generally no better than the 1.6 build for ubi8. (See replace catastrophically slow jq with a Python script.) Comparing the available package in different image tags…

CHECK_JQ="jq --version ; echo ; rpm -qlivP jq ; echo"
INSTALL_JQ="microdnf update -y >/dev/null 2>&1 && microdnf install -y jq >/dev/null 2>&1 ; echo ; ${CHECK_JQ}"

# latest UBI8 minimal currently available
docker run --privileged -it --entrypoint bash registry.access.redhat.com/ubi8/ubi-minimal:8.6-941 -c "${INSTALL_JQ}" > /tmp/ubi-8.6-941.txt

# latest UBI9 minimal currently available
docker run --privileged -it --entrypoint bash registry.access.redhat.com/ubi9/ubi-minimal:9.0.0-1644 -c "${INSTALL_JQ}" > /tmp/ubi-9.0.0-1644.txt

# when postigrade still had jq 1.5 (I couldn't quickly find a cloudigrade tag from around the same time)
docker run --privileged -it --entrypoint bash quay.io/cloudservices/postigrade:196a9ef -c "${CHECK_JQ}" > /tmp/postigrade-196a9ef.txt

# latest cloudigrade currently available
docker run --privileged -it --entrypoint bash quay.io/cloudservices/cloudigrade:aea8323 -c "${CHECK_JQ}" > /tmp/cloudigrade-aea8323.txt

All the recent tags all jq 1.6, but the RPMs have slightly different build info.

❯ grep Version /tmp/*.txt
/tmp/cloudigrade-aea8323.txt:Version     : 1.6
/tmp/postigrade-196a9ef.txt:Version     : 1.5
/tmp/ubi-8.6-941.txt:Version     : 1.6
/tmp/ubi-9.0.0-1644.txt:Version     : 1.6

Full outputs attached: cloudigrade-aea8323.txt postigrade-196a9ef.txt ubi-8.6-941.txt ubi-9.0.0-1644.txt

So, trying a crude little benchmark...

INSTALL_JQ="microdnf update -y >/dev/null 2>&1 && microdnf install -y jq >/dev/null 2>&1"
BENCH='START="$(date "+%s")"; for _ in $(seq 1 500); do echo "[]" | jq . > /dev/null; done; END="$(date "+%s")"; echo "$((END - START)) seconds"'

# latest UBI8 minimal currently available
docker run --privileged -it --entrypoint bash registry.access.redhat.com/ubi8/ubi-minimal:8.6-941 -c "${INSTALL_JQ} ; ${BENCH}" > /tmp/jq-time-ubi-8.6-941.txt

# latest UBI9 minimal currently available
docker run --privileged -it --entrypoint bash registry.access.redhat.com/ubi9/ubi-minimal:9.0.0-1644 -c "${INSTALL_JQ} ; ${BENCH}" > /tmp/jq-time-ubi-9.0.0-1644.txt

# when postigrade still had jq 1.5 (I couldn't quickly find a cloudigrade tag from around the same time)
docker run --privileged -it --entrypoint bash quay.io/cloudservices/postigrade:196a9ef -c "${BENCH}" > /tmp/jq-time-postigrade-196a9ef.txt

# latest cloudigrade currently available
docker run --privileged -it --entrypoint bash quay.io/cloudservices/cloudigrade:aea8323 -c "${BENCH}" > /tmp/jq-time-cloudigrade-aea8323.txt

This confirms that the old jq 1.5 is still an order of magnitude faster than any of the 1.6 versions here.

❯ for F in /tmp/jq-time-*.txt; do echo -n "$F "; cat $F; done
/tmp/jq-time-cloudigrade-aea8323.txt 15 seconds
/tmp/jq-time-postigrade-196a9ef.txt 1 seconds
/tmp/jq-time-ubi-8.6-941.txt 15 seconds
/tmp/jq-time-ubi-9.0.0-1644.txt 12 seconds
infinitewarp commented 2 years ago

A slightly cleaner version of my test script now lives here: https://github.com/infinitewarp/cloudigrade-perf-test

I just ran that against an older cloudigrade commit and then over the latest master converted to ubi9, and it worked the same as before.

The full runs from last week's tests (on 2022-09-13) are captured at:

abellotti commented 2 years ago

@infinitewarp thank you for checking on ye-old jq. Also, 👍 on https://github.com/infinitewarp/cloudigrade-perf-test repo. We can re-run on minor release updates of ubi-9 to see if there are any incremental improvements and decide on pulling the switch when appropriate.