Open krancour opened 2 years ago
A closer look at the logs reveals that three jobs appear not to have made it onto the queue, although they did. Two of them happened to succeed before the observed cleaned up the dead worker. The one of the three still running was aborted at the time of cleanup.
Twice today, I saw a job aborted for no readily apparent reason, and the associated worker failed.
The first occurrence was with v2.3.1 and the second was after upgrading the dogfood cluster to v2.4.0-rc.2.
Here's the data I gathered from the second occurrence:
Worker logs reveal this:
An internal server error by the API server should have been logged, so I looked there next:
This has the appearance of the job failing to be placed on the queue, for reasons I cannot yet diagnose.
However, if I look at the logs for that job, they exist. The job ran. In fact, it executed hundreds of unit tests before it was forcefully terminated. This means the job did make it onto the queue after all... and this explains a lot...
After the worker failed (or so it thought) to create a new job, the worker exited non-zero. The observer recorded this correctly as a worker failure. Eventual worker cleanup forcefully killed the running job, which actually was created.
There's more to investigate here, but the fundamental question is probably one of how we ended up with an error that makes it appear as though the job never made it onto the queue event though it very clearly did.