Open tatiana opened 1 month ago
Note that this only happens if the DAG is not already parsed by the DAG processor. Once parsed, the DatasetAlias will have an entry in the database, and the DAG will run as expected.
Running this with a plain Dataset under the same condition also produces an unexpected result:
with DAG("dataset_dag", ...):
BashOperator(bash_command=":", outlets=Dataset("some_dataset"))
The task will succeed, but no event is emitted.
$ airflow dags test dataset_dag
...
[2024-09-26T09:50:38.635+0000] {manager.py:96} WARNING - DatasetModel Dataset(uri='some_dataset', extra=None) not found
...
Ultimately this is due to airflow dags test
is essentially out-of-band execution that does not involve in the normal DAG parsing and scheduling. Dataset event emission is entirely implemented around the database, and thus does not work well in this mode.
I think we should make a decision how the design of this entire function going forward. Since the situation is sort of similar to airflow dags backfill
, maybe we should make this also go through the scheduler instead? (Basically make the CLI behave like triggering a manual run in web UI.) This is a relatively large undertaking though. What should we do for 2.x?
I thought about this a bit and arrived at the conclusion that test
should simply act entirely differently from other forms of execution: Airflow already provides a way to run a DAG outside of its schedule with manual runs (via the web UI or airflow dags trigger
), and test
(either from CLI as in this issue, or dag.test()
in a Python file) should do something different to be meaningful.
Judging by the word test
, I’m thinking this operation should further differ from trigger
in that the result should be completely ephemeral and idendempotent (as long as the tasks themselves are). This means that the test run should leave no trace in the database after execution, including run records, task logs, etc. (I believe this is already almost the case.) And equally, _outlets defined in tasks should not leave a DatasetEvent entry in the database, and threfore do not trigger downstream DAGs.
Does this sound reasonable?
@uranusjr It seems sensible for the dags test
not to leave traces in the database after it is run. Another option could be to give users a configuration option and allow them to run a cleanup task as they please.
That said, I believe it is essential that people can use this command to validate somehow:
Dataset
being set as outlets/inlets of the desired tasks/DAGs?DatasetAliases
being set as outlets/inlets of the desired tasks/DAGs?These two areas are particularly important when we take into account dynamic DAG generation tools.
Since the command is erroring, I understand that we cannot currently do this.
Currently, users of dags test
can already:
If we want datasets/dataset aliases/assets to be first-class citizens in Airflow, we must also have a way of testing them efficiently.
Agreed on the next steps with @uranusjr :
test_mode
config (https://airflow.apache.org/docs/apache-airflow/2.1.0/howto/use-test-config.html)As a mid-long-term task, we must rethink how we want the dags test
to work. @ashb suggested that the command be written to a "test" database. Before we close this ticket, we should have a proposed plan for this longer term as a GitHub ticket.
Apache Airflow version
2.10.0, 2.10,1, 2.10.2
If "Other Airflow 2 version" selected, which one?
No response
What happened?
Given the DAG:
When I try to run:
I get the error:
This DAG successfully executes when not being triggered via the
dags test
command.What you think should happen instead?
I should be able to run
dags test
for this DAG without seeing this error message.How to reproduce
Already described.
Operating System
Any
Versions of Apache Airflow Providers
No response
Deployment
Other
Deployment details
It is not happening during deployment (tested in Astronomer, and it worked fine). The issue happens when running the
airflow dags test
command locallyAnything else?
No response
Are you willing to submit PR?
Code of Conduct