Closed saulbein closed 11 months ago
Thanks for opening your first issue here! Be sure to follow the issue template!
Is it possible that yoy upgrade to latest version of Airflow relased and test it @saulbein - we had a number of stability fixes in 2.4 (and 2.5.0 is ~already~ about to be released) and while I understand upgrading in the hopes of fixing an issue might not seem like justified, it often happens that the issues get fixed and some similar /vaguely relevant issues get fixed. We have reports telling that a number of stuck/queue problems have been fixed by some of the changes, and the best way in case of the issue that cannot be easily reproduced is to try the latest version. Doing it is rather straightforward and in case you have the issue happening relatively quickly, it might give a very quick feedback without spending a lot of time to diagnose it.
And in any case - we only release fixes in the latest "minor" verision - so upgrading is anyhow the only possible way to fix (so the effort to upgrade is not lost - upgrade has to be performed anyway).
Can you please help with that @saulbein ?
Alright, I'll see if we can upgrade this or next week. I don't see 2.5.0 in the releases though?
right: "it about to be released" it should be - we have just cancelled voting for RC1 and there is an RC2 coming today (will take 72+ hours of testing/voting at least to make it into official release).
But if you can upgrade to 2.4.3 - you can do it now (and release to 2.5.0 once it is out and some reports showing that thigns are good by the "bleeding edge" people. I am always repeating this - things like upgrades should be done more frequently rather than less - it makes the upgrade process far less painful overall
The 2.5.0 is relased now. Try it now and see if you can reproduce the problem there @saulbein (and post stacktraces/logs etc. if you do please).
Due to my company's christmas development slowdown I don't think I'll be able to get the upgrade to 2.5.0 done until some time early next year. I'll still check on 2.4.3, maybe the issue was fixed in that version already.
We can wait, no worries.
Updated to 2.5.0 (unfortunately right before 2.5.1). updated the issue description with the current providers that we use and the log itself. Now we don't really get tasks restarting anymore, anything that is running during shutdown gets killed and doesn't attempt to retry.
This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.
bump, still needs investigation - if I should provide more info, let me know
@eladkal -> I added needs-triage
- I guess this is precisely the case where we should add it back?
@eladkal -> I added
needs-triage
- I guess this is precisely the case where we should add it back?
ho yeah... I thought i added it :facepalm: Now that the report is on latest Airflow version we need to triage/validate if we have enough information to check this issue or we should ask for further info from the author. So yeah we need to label so this issue is on the queue and waiting for triage
@saulbein I'm curious if you deleted the DAG files or performed any synchronization during the restart.
ERROR - DAG '...' not found in serialized_dag table
It seems like the DAG files are missing from the dagbag. In this situation, the scheduler won't be able to schedule the failed tasks or queued them.
@saulbein I'm curious if you deleted the DAG files or performed any synchronization during the restart.
ERROR - DAG '...' not found in serialized_dag table
It seems like the DAG files are missing from the dagbag. In this situation, the scheduler won't be able to schedule the failed tasks or queued them.
No deletes, just replacing the files that were modified. We even switched to git sync and still get the ... not found in serialized_dag table
errors when restarting the services (sadly can't easily not restart them for now).
Besides, shouldn't the scheduler be able to schedule them after they get parsed? Or does it fail once and then give up?
This error report is stale since a while, I tried to follow the discussion to clean up the issue backlog. I assume under the current release 2.7.2 there had been many changes and we would need to refresh the error report.
One thing that I am curious from the last message is that DAG file are being replaced (during restart?). I assume we should consider for the bug that "only" a restart of the Airflow system (assumption: including DB and worker?) is made. Or do you just restart the scheduler? Is DAG Parsing within the scheduler process? If there are also DAG parsing problems reported, it might be caused by bad DAG parsing performance as well. If DAG parsing takes too long, DAGs not parsed for a longer time are dropped from the database. That might be a side effect causing this message.
Can you maybe craft more details about the tasks you execute or implement a dummy DAG in the error report, which shows the error and try to re-produce on Airflow 2.7.2?
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.
This issue has been closed because it has not received response from the issue author.
Apache Airflow version
Airflow 2.5.0
What happened
We're having an issue due to the fact that sometimes we need to restart Airflow services. Tasks that were running during the restart will not retry properly (with Airlfow 2.3.4 some would retry successfully, now that is no longer the case). I have not managed to figure out the cause of this, but I can provide logs from the investigation (see below).
Discussion #27071 seems to have a similar issue as well.
What you think should happen instead
All tasks should have been marked for retry and retried.
How to reproduce
I have not managed to reproduce this issue locally, every attempt it just retries the tasks as expected.
Operating System
Debian GNU/Linux 11 (bullseye)
Versions of Apache Airflow Providers
Airflow version: release:2.5.0+fa2bec042995004f45b914dd1d66b466ccced410 Provider versions (there are more providers used by DAGs, but they should not be relevant): apache-airflow-providers-amazon==7.0.0 apache-airflow-providers-celery==3.1.0 apache-airflow-providers-postgres==5.4.0
Deployment
Other Docker-based deployment
Deployment details
Setup with issues (runs on a k8s cluster):
Local test setup (same docker images used as the setup with issues):
Anything else
Notes:
dag_1
anddag_2
have retries set to1
for all tasks and the retry delay is 15minsLogs of when the issue happened (
dag_1.task_a
anddag_2.task_b
both get interrupted by the deployment):Are you willing to submit PR?
Code of Conduct