Closed tseruga closed 1 month ago
After some more digging, I've narrowed down what's causing this. It appears as though only a partial key is being used to index into the _callback_to_execute
dictionary maintained by the DagFileProcessorManager
.
Instead of the full path, including both the zip file name as well as the python file within the zip - the dictionary is being indexed only by the path up to the zip file name.
E.g.
Indexing into the callback dictionary uses /files/dags/example_sla_dag_zipped.zip
when the dictionary in reality has a key of /files/dags/example_sla_dag_zipped.zip/example_sla_dag_zipped.py
.
Will continue looking into what appears to be causing the mismatch
A bit more digging, and I think this is the culprit (although my suggested fix may have unintended ramifications so I'd like to have some input from a more knowledgeable contributor). Obviously the intention with the code below was to avoid adding the filepaths to the queue, but this unintentionally broke SLAs for packaged DAGs (#30076 )
When this block is executed the full filepath (including the zip filename) is not added to the paths, and is thus never used to index into the callback dictionary that the above code is adding to.
I believe it should be something like this instead:
self.log.debug("Queuing SlaCallbackRequest for %s", request.dag_id)
self._callback_to_execute[request.full_filepath].append(request)
self._add_paths_to_queue([request.full_filepath], True) # This line is added
Stats.incr("dag_processing.sla_callback_count")
FWIW, I did add this line and things worked locally - SLA Misses were firing for DAGs within zip files
cc: @uranusjr @potiuk guessing you might have an opinion here āš½
Looping in @argibbs (author of #30076 which added the change) for some potential thoughts.
Any movement/triaging done on this? We effectively reworked the way that we handle SLAs to entirely sidestep the sla_miss
table and manually calculate SLA miss events, but we'd prefer a more native solution (especially since this was an undocumented regression).
I think itf @argibbs is not responding then it is up for grabs for anyone who would like to fix it. I will remove the needs-triage
because apparently it's a real error and hopefully someone will take a look and fix it.
But the fastest way to fix it is that someone from your team will attempt to do so and provide pr @tseruga - since you are interested and have a way to reproduce, test - Airflow is developed by > 2600 contributors, so this is an easy way to become one.
Hello, sorry, not deliberately ignoring people, just not checking my mail as often as I should.
(Insert real-life-getting-in-the-way comment here).
Haven't read the backlog yet so not sure what the problem is, but am very happy for someone else to fix the problem if they already have a good handle on it.
Otherwise will have a look when I get a chance. I will be on a brief break between jobs next month, so might have time then
Thanks, Andrew
On Tue, Sep 5, 2023, 16:51 Jarek Potiuk @.***> wrote:
I think itf @argibbs https://github.com/argibbs is not responding then it is up for grabs for anyone who would like to fix it. I will remove the needs-triage because apparently it's a real error and hopefully someone will take a look and fix it.
But the fastest way to fix it is that someone from your team will attempt to do so and provide pr - since you are interested and have a way to reproduce, test - Airflow is developed by > 2600 contributors, so this is an easy way to become one.
ā Reply to this email directly, view it on GitHub https://github.com/apache/airflow/issues/33410#issuecomment-1706882480, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBTZ4XHQKRJAR7HSZIBBRLXY5DBTANCNFSM6AAAAAA3RHGM6I . You are receiving this because you were mentioned.Message ID: @.***>
Ok,
Have managed to get to a real computer with a keyboard and everything and have a proper look.
First up, my apologies for breaking something, that was obviously not the intent.
So, comments:
self._add_paths_to_queue([request.full_filepath], True)
is unfortunately problematic ... this would effectively undo the fix I made in the first place. You can read my PR + associated for more details, but basically SLA callbacks can fire so often that the file queue never drains, and the airflow scheduler effectively stops processing changes to dag files.prepare_file_path_queue
at approx like 1200). Quite what this involves I don't know, I've never used zipped dags, and I'd have to experiment locally as part of the fix.
ii) go further with my SLA fix. I did originally have a more involved solution (again if you look at my PR history, you can see what it was) that split the queues and effectively tracked which files had been processed. Then even if the SLA callbacks were spamming the queue, it wouldn't matter, because we'd ensure any dags which hadn't been refreshed after X seconds (by default I think it's 30 seconds or maybe 60) would then get priority and would be refreshed. SLA callback spam would make the system less responsive to updates, but it wouldn't stop ignoring them altogether. This change was originally abandoned because it was too big a change and people were understandably nervous, but maybe we could try again with as minimal an impl as possible. If we did that, then the proposed fix by @tseruga would be fine.That said, while I think the impl in (2) would work, it does feel like one of those fix-the-symptoms-not-the-cause type things, and (1) is probably cleaner, for some value of clean.
As mentioned before, I will probably have an opportunity to look at this in October maybe. But no promises - if you can come up with a good solution before then, go for it. I'm happy to weigh in on PRs.
This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.
This issue has been closed because it has not received response from the issue author.
Apache Airflow version
2.6.3
What happened
First and foremost, I understand that the current posture is that the currently implemented SLA mechanism is buggy and is being refactored heavily, but still wanted to call out this regression in case it's something that is resolvable by someone with a bit more knowledge of how this might've broken.
DAGs within Zip files (packaged DAGs, as defined here: https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html#packaging-dags) appear to never fire SLA Miss events in recent versions of Airflow (2.5.1+).
Our team recently upgraded from Airflow 2.5.1 to 2.6.3 and noticed that SLA misses were not being recorded.
What you think should happen instead
Airflow should treat packaged DAGs the same as non-zipped DAGs. SLA Miss records should be generated regardless of whether the DAG is in a zip file or not.
Given two DAGs with the exact same definition (aside from name) with one being zipped and the other not being zipped - we would expect both to fire off exactly the same number of SLA Miss events.
How to reproduce
from airflow import DAG from airflow.operators.bash import BashOperator
with DAG( 'sla_test_zipped', # or 'sla_test_unzipped' schedule_interval='/2 *', start_date=datetime(2021, 1, 1), catchup=False, ) as dag:
apache-airflow-providers-common-sql==1.3.4 apache-airflow-providers-ftp==3.3.1 apache-airflow-providers-http==4.2.0 apache-airflow-providers-imap==3.1.1 apache-airflow-providers-sqlite==3.3.1