Open tatiana opened 1 month ago
cc @cmarteepants
Thanks, @tatiana. We recently had dag-factory suggested to us, so I've been investigating it this weekend. Like you say, if we can't have those details surfaced, it makes it difficult to troubleshoot anything and that'll be a hard sell to our support engineers.
In a perfect world, the generated DAG code would be presented in the Code view (or custom tab), but even having the YAML accessible in some form would be great.
It's a workaround and could get messy at scale, but I've got the YAML going into doc_md and presenting in the DAG Docs..
Thanks @rory-data, that makes sense for a workaround. What I'm hearing is that in a perfect world, you would see the serialized DAG in the code view versus the contents of the yaml file. Is that right?
@rory-data I can fully relate. @cmarteepants is helping us prioritise this feature; it's high on our board, and we'll keep this thread up-to-date as it is picked up/developed.
@cmarteepants I would think so from a consistent or predictable experience perspective for support engineers. I do understand that it's not catered for in Airflow (Ash mentioned in Slack that it could be in AIP-85), so it may be a case of having the YAML presented in a better place than DAG Docs and using tagging to indicate a given DAG has been generated. That's just my two cents though 😅
Perhaps we can have a strategy in the short-term and one in the long-term, WDYT @cmarteepants ?
@rory-data how has been the experience with doc_md so far? What were the main advantages and drawbacks of this approach..?
@cmarteepants, do you think doc_md
is a better or worse solution than templated fields? Are they complementary? I just realised that since 2.9, we have also introduced a cap to templated fields in Airflow (which can be configured; still, some users may face issues without realising it).
I like the idea of tagging DAGs created by dag-factory - I think we should certainly do this.
I admittedly only came up with the idea in the weekend just gone as part of our initial investigations into using DAG Factory, so it's not being used in any live processes.
However, I can see that a large YAML input could be potentially unwieldy in the DAG Docs field from a UX perspective with it being an expandable section high in the DAG UI. The way I've currently got it also means it would overwrite any user-supplied doc_md values, so I need to adjust it to append instead.
It needs polish, but FWIW, this is what I did for the example_dag_factory.py script in the examples directory.
"""
Airflow currently cannot serialise a DAG generated by DagFactory into the Code view of
the UI (this may be resolved in AIP-85).
As a workaround, the below change to the process of generating DAGs will render the YAML
into a markdown string and add it to the doc_md field so it is viewable in the UI DAG
Docs.
"""
import os
from pathlib import Path
from typing import Any, Dict
import yaml
from airflow import DAG ## by default, this is needed for the dagbag to parse this file
import dagfactory # type: ignore
DEFAULT_CONFIG_ROOT_DIR: str = "/usr/local/airflow/dags/configs"
CONFIG_ROOT_DIR: Path = Path(os.getenv("CONFIG_ROOT_DIR", DEFAULT_CONFIG_ROOT_DIR))
def serialise_config_md(key: str, value: Dict[str, Any]) -> str:
"""
Serialises a given key and its corresponding dictionary value into a formatted
YAML string enclosed in Markdown code block syntax.
Args:
key (str): The key to be serialized.
value (Dict[str, Any]): The dictionary value associated with the key.
Returns:
str: A string containing the YAML representation of the key-value pair,
formatted within a Markdown code block.
"""
yaml_string = yaml.dump({key: value}, default_flow_style=False, sort_keys=False)
return f"```yaml\n{yaml_string}\n```"
config_path: str = str(CONFIG_ROOT_DIR / "example_dag_factory.yml")
with open(config_path, encoding="utf8") as file:
config_data: Dict[str, Any] = yaml.safe_load(file)
for dag_id, dag_details in config_data.items():
# for each DAG in the config, add the YAML to the doc_md field
if dag_details.get("doc_md"):
# check if doc_md already exists and append to it
dag_details["doc_md"] += "\n" + serialise_config_md(dag_id, dag_details)
else:
# if doc_md does not exist, create it
dag_details["doc_md"] = serialise_config_md(dag_id, dag_details)
dag_factory = dagfactory.DagFactory(config=config_data)
dag_factory.clean_dags(globals())
dag_factory.generate_dags(globals())
@cmarteepants, do you think doc_md is a better or worse solution than templated fields? Are they complementary? I just realised that since 2.9, we have also introduced a cap to https://github.com/apache/airflow/pull/38094 in Airflow (which can be configured; still, some users may face issues without realising it).
Going off of memory, but cap for templated fields was introduced as we were seeing a lot of DB perf issues. One of the causes was because of the issue that you linked, but there were others. I don't love the idea of appending the yaml contents to dag docs either, but it's a fair short-term workaround and more robust solution compared to using templated fields.
It's a weird predicament: If we do add a view via a UI plugin, I'd rather we do it in react and in context of the new Airflow UI (and we will for AF3), not with FAB. I'm also not a fan of making a config for this and have DAG Factory handle appending this to docs if it's only a stop-gap, but I am open to it. @rory-data - is it reasonable to document your solution and have people add it to the dag generator and do this properly for Airflow 3? Or do you feel strongly that we should have a config to add the yaml to your dag docs?
@cmarteepants I'd be fine with that. As you (and Ash) have pointed out, there are constraints in Airflow 2 that make this difficult, so documenting it as an optional workaround for now at least gives folk a choice. Having a specific config built would make it more consistent and easier for users, but is it a tactical fix that would be obsoleted in AF3 so not worth the effort, or do you need to have something robust for those who'll stay on AF2?
If we have DAG Factory itself append the yaml contents to DAG docs, it would be a tactical fix that realistically would become obsolete in AF3. I don't know if we'd go as far as officially deprecating it though; it's more that I can't see anybody wanting this behaviour if we're properly showing you the rendered yaml as a first class feature. Great point about having something more robust than an example for those staying on AF2 for awhile.
Ok, I'm convinced. I can't imagine it being a lot of effort for us to do this. @tatiana only callout I have that we should only display the yaml contents that are relevant to that DAG (many DAGs can be defined in a single yaml). We should also include anything found in the default section
@rory-data we'll target the next release, which will be sometime near the end of December. In the meantime, are you ok with your current workaround?
@cmarteepants yes, we'll be good with that, thanks.
Context
Even though the YAML is the source of truth for the DAG topology and the operators/tasks configuration, we cannot see those in the Airflow UI. This can make it harder to troubleshoot and confirm the behaviours are as expected or that the expected version was deployed.
Acceptance criteria