celery / django-celery-results

Celery result back end with django
Other
668 stars 206 forks source link

TimeLimitExceeded failures are not recorded to results table #367

Open ericswpark opened 1 year ago

ericswpark commented 1 year ago

When a task fails for exceeding the specified time limit, theFAILURE state is not saved to the results table.

I have to manually check the status of the task using a tool like flower to see if tasks have failed or not.

This may be related to issue #37, as I use update_state() to send back progress data to Django.

Here is the relevant code snippet:

https://github.com/shipperstack/shipper/blob/master/core/tasks.py#L101-L103

auvipy commented 1 year ago

can you check this PR https://github.com/celery/django-celery-results/pull/366

ericswpark commented 1 year ago

@auvipy I'll test the PR once it makes it into a release.

ericswpark commented 1 year ago

@auvipy after testing with 2.5.0 I am still getting the same behavior, unfortunately.

Flower says the task has failed with a TimeLimitExceeded exception:

CAD02D90-BA48-43BA-B880-36837E2614A4

However on the results page it still shows PROGRESS for the task state:

4F1F7366-6E0B-493B-9AB3-27862897E20A

auvipy commented 1 year ago

I suggest to dig deeper into the issue and find out the root cause

ericswpark commented 1 year ago

@auvipy I tried looking through django-celery-results code, but I don't understand how it works. Does Celery workers/tasks send results to django-celery-results using some sort of callback, or does django-celery-results poll the workers/tasks for the current status/progress?

If it's the former, I'm thinking one of the following:

  1. Celery task fails but the worker is not reporting the failure to django-celery-results for some reason.
  2. The progress update function is called, then the task fails. Celery reports to django-celery-results of the failure, which is recorded to the database. Then the earlier progress update function finally finishes updating the database, overwriting the FAILURE status. But I highly doubt a race condition is the issue here, as nearly all of the tasks do not show a FAILURE status.
  3. Tasks launched with django-celery-beat are not properly reflected in the results table. Only the task entry is created and the SUCCESS/FAILURE status is not.

Any ideas here would be appreciated!

ericswpark commented 1 year ago

I figured out that any invocations of the progress state update will cause the state of the task to get "stuck" within the results table:

https://github.com/shipperstack/shipper/blob/master/core/tasks.py#L101-L107

I think that this is because paramiko (the SFTP library used within the task) is asynchronous, and when the timeout is reached the last call to the state update is made after the FAILURE result is recorded to the database.

I guess I could have the callback check if the state is in FAILURE before updating, but am not sure if that won't have a race condition as well. Ideally it should not overwrite a SUCCESS/FAILURE status.

ericswpark commented 1 year ago

@auvipy would it be possible to have django-celery-results await all asynchronous callbacks before writing the FAILURE result to the database? On Flower, the FAILURE result stemming from a TimeLimitExceeded exception is shown regardless of the state update being enabled or not.

I've tried alternatives, like not setting the state and only setting the meta information, but it seems like that will not work because the state then gets set to null. And as mentioned before, checking the state in the async function is also problematic since there could always be a race condition.

auvipy commented 1 year ago

problem is none of the celery packages support async await as of yet