CityofToronto / bdit_data-sources

Data sources used by the Big Data Innovation Team
https://github.com/orgs/CityofToronto/teams/bigdatainnovationteam
GNU General Public License v3.0
36 stars 8 forks source link

Standardizing airflow doc_md #769

Open chmnata opened 7 months ago

chmnata commented 7 months ago

We can now create Markdown-based DAG documentation that appears in the Airflow UI, we should standardize them across all of our dags. https://docs.astronomer.io/learn/custom-airflow-ui-docs-tutorial

For Dag level doc, proposing:

### [Frequency] [data source] [verb] 
One liner of what this DAG does.
If ETL: 
Extracted from:
Transformation/Data Quality Checks:
Loads to: [schema.table][database]

example:

### Daily HERE pull
This DAG runs daily to pull here data from traffic analytics' API to here.ta in 
the bigdata database using Taskflow. Slack notifications is raised when the airflow 
process fails.

Extracted from: HERE API
Transformation/Data Quality Checks: None
Loads to: here.ta (bigdata)

For task level doc:

### Name of the task
One liner of what this task does, if functions were used, list out functions

example:

### Task load_data_run 

This task uses BashOperator to load data by `curl` the download_url retrieved 
from the previous task (get_download_link), gunzip the file and directly pipes 
to here.ta_view_path` using psql. 
### Task send_request

This task sends a request to the HERE API and returns an access token to be used in the following tasks.
Function used:
get_access_token(pwd, client_secret, token_url)
chmnata commented 7 months ago

What do yall think? Feel free to modify and comment! @tahaislam @radumas @gabrielwol

tahaislam commented 7 months ago

That looks pretty good. We need to note that some DAGs loads more than just one dataset into different tables. Also, some DAGs are externally triggered, i.e., their frequency depends on the external trigger frequency. We can add this at the end of the description.

Another suggestion for DAGs documentation, I added the DAG description (in some DAGs) at the top of the file as the docstring of the file and then assigned it to the DAG markdown doc using doc_md=__doc__. This would also help us while working directly on the code. What do you think?

https://github.com/CityofToronto/bdit_data-sources/blob/d4d7cd84bb12bad6c72a3af026b70d95c178bfaf/dags/pull_counts.py#L4C1-L30C4

https://github.com/CityofToronto/bdit_data-sources/blob/d4d7cd84bb12bad6c72a3af026b70d95c178bfaf/dags/pull_counts.py#L66

radumas commented 7 months ago

would there be a way to ensure that the doc_md is somehow also synced with the README?

Could there be a way to load it from the README.md 🤔

gabrielwol commented 6 months ago

would there be a way to ensure that the doc_md is somehow also synced with the README?

Could there be a way to load it from the README.md 🤔

Figured out a solution using regex!

image

doc_md_path = os.path.join(repo_path, 'volumes/vds/readme.md')
contents = open(doc_md_path, 'r').read()
doc_md_regex = '(?<=### vds_pull_vdsvehicledata DAG \n)[\s\S]+(?=#{1,3} )'
DOC_MD = re.findall(doc_md_regex, contents)[0]
radumas commented 6 months ago

So that's something like "starts with '### vds_pull_vdsvehicledata DAG' and then everything until the next heading"?

gabrielwol commented 6 months ago

Exactly! Feels like a manageable amount of regex to figure out per DAG. Happy to set it up initially.

radumas commented 6 months ago

we could hide a key in html comments (like to turn an entry off for the table of contents <!-- omit in toc --> ) that we wrap around the doc_md portion of the README to standardize the code. e.g.:


<!-- #doc_md -->
### vds_pull_vdsvehicledata DAG

and so on ...

<!-- #doc_md -->
gabrielwol commented 6 months ago

Done! https://github.com/CityofToronto/bdit_data-sources/commit/bf62b703b9e2a589e0189abfa00fa66d06fb7745