Database issues can lead to "stuck" images

andrewm-aero commented 3 years ago

Is this a request for help?: No

Is this a BUG REPORT or a FEATURE REQUEST? (choose one): BUG REPORT

Version of Anchore Engine and Anchore CLI if applicable: 0.9.1

What happened: If the database is inaccessible when an image is finished analyzing, it will become "stuck", no analyzer will ever attempt to analyze it, but it is impossible to delete the image or retry the analysis. I could not find any documented method for resolving this.

Timeline:

Image is submitted for analysis
Analyzer begins analysis
Database enters recovery mode
Analyzer finishes analysis
Analyzer retries to simplequeue are exhausted, image is not recorded as analyzed
Database leaves recovery mode
Image is now "stuck", it can never be analyzed

What did you expect to happen: This situation to be detected and resolved automatically or there to be a manual "escape hatch" where an image can be forced to restart analysis by an administrator via the CLI or REST API.

Any relevant log output from /var/log/anchore: Sorry, I had hoped that I could force it to restart by deleting the analyzer pod in question, and did not think to save the logs beforehand, so I do not have the analyzer logs.

[service:simplequeue] 2021-09-16 16:42:23+0000 [-] "10.1.12.206" - - [16/Sep/2021:16:42:22 +0000] "GET /v1/queues/watcher_tasks?wait_max_seconds=30&visibility_timeout=0 HTTP/1.1" 500 138 "-" "python-requests/2.23.0"
[service:simplequeue] 2021-09-16 16:42:23+0000 [-] [Thread-8] [anchore_engine.services.simplequeue/handle_metrics()] [WARN] handler failed - exception: (psycopg2.OperationalError) FATAL:  the database system is in recovery mode
[service:simplequeue]

What docker images are you using: docker.io/anchore/anchore-engine:v0.9.4

How to reproduce the issue:

Submit an image for analysis
Wait for analyzer to begin analysis
Render database inaccessible
Wait for analyzer to finish analysis and exhaust retries
Restore database functionality
Attempt to delete the image, re-add the same image, or wait for analysis to finish. This will fail, do nothing, and wait until the heat death of the universe, respectively.

Anything else we need to know: Potential remediation: Generate a random "analysis ID" when an image is added. Allow deletion of images in the "analyzing state". Have simplequeue disregard any completed analysis from an analyzer if the "analysis ID" does not match, or if the image is not known. This could allow recovery from this state by simply deleting and re-adding all stuck images.

zhill commented 3 years ago

Hi @andrewm-aero , thanks for the detailed write-up. There is a timeout for images stuck in the 'analyzing' state that should trigger after 10 hours (yes, it's very long to ensure no overlap with a legitimate but very slow analysis of very large (10+GB) images. Once that timeout kicks it should put the image back in the analysis queue.

andrewm-aero commented 3 years ago

Thank you for the response. That's all well and good, but is there no way to do this manually? As an admin, I can very clearly see when an analyzer failed, and that no other analyzers are trying. Even if I have to manually write a curl request, I'm sure you'd agree that'd be better than holding things up for 10 hours.

anchore / anchore-engine

Database issues can lead to "stuck" images #1221