ConservationMetrics / map-packer

A Nuxt app to allow users to generate and manage offline map requests to mapgl-tile-renderer
MIT License
3 stars 0 forks source link

Debug Azure task worker replicas that are crashing while processing #30

Open rudokemper opened 4 months ago

rudokemper commented 4 months ago

It's been observed a few times that a task worker replica is crashing halfway through processing a render request, resulting in the request being left in an indefinite state of PROCESSING. (It doesn't look like the message is ever returned to the queue, but I could be wrong.)

There is an existing issue to assign FAILED status to a map request after a reasonable timeframe of remaining in PROCESSING, but we should also debug why this is happening in the first place. It does not seem to be related to reaching a container filesize or memory threshold, since the task worker has successfully processed much larger offline map requests than ones that are currently crashing replicas.

There is an existing map request left in this indefinite state in our (CMI) online deployment; we could use similar parameters to submit a new request, and follow the log of a replica as it processes the request.

@IamJeffG can you think of any postmortem ways to figure out what happened with a task worker replica when it crashes? From what I can tell, after the fact, the container app just spins up another replica and there is no trace left from any that were killed.

rudokemper commented 4 months ago

I was wrong: the message is eventually returned to the queue. This has also been observed before, where you have successful map requests with the following render results:

Requested on: 03 June 2024 at 10:51
Finished on: 03 June 2024 at 13:04
Task Duration: 0h 12m 22s

Since the map request is eventually re-published to the message queue, this is therefore a less worse situation, but I'd still like to debug why the first replica failed to complete the request. (We know that it was picked up in the first place by a replica, since the status was set to PROCESSING, which is the first thing mapgl-tile-renderer does when receiving a message from ASQ.)

IamJeffG commented 4 months ago

I don't think we have a postmortem way to do this.

I have just set up Application Insights on our deployed Container App Environment that will (going forward) retain logs for 30 days.