datamade / how-to

📚 Doing all sorts of things, the DataMade way
MIT License
86 stars 12 forks source link

Capture Airflow knowledge #88

Closed hancush closed 3 years ago

hancush commented 4 years ago

Documentation request

We're using Airflow to run the Metro scrapers and subsequent ETL! Let's capture this tool as part of our stack, and document lessons learned and any snake pits.

hancush commented 3 years ago

This is high value documentation that I'd like to capture before November. I think a divide and conquer approach will be most expedient. Some questions I would find useful to answer:

  1. Why Airflow over other strategies? @hancush
  2. What are the basic concepts of Airflow? (Pointers to existing documentation would be great.) @fatima3558
  3. Did we change any settings from their defaults? Why? (Was is specific to the Metro dashboard, or would it be good to change the default for every instance?) @fatima3558
  4. How do we deploy Airflow? If it's not Heroku, what were the sticking points? @jeancochrane
  5. What is our strategy for running DAGs? (What operator do we use? How do we manage dependencies?) @hancush
  6. Debugging Airflow @jeancochrane

I can head up 1 and 5. @fatima3558, do you think you could pick up 2 and 3? And @jeancochrane, could you do 4?

Also, are there other learnings we want to capture?

jeancochrane commented 3 years ago

I would add "Debugging Airflow" to this list. Happy to pick this one up along with 4. I suspect @fatima3558 may not have the context for 3 so I'd be fine picking that one up, although if Fatima can tackle it I'd be glad to see someone else do it too.

fgomez828 commented 3 years ago

I can definitely pick up 2, and I can give 3 a try as well!

hancush commented 3 years ago

Cool! I've updated my original comment to add debugging and assign everyone to their respective sections.

fgomez828 commented 3 years ago

I've given no. 3 a shot, but I doubt what I've written in the google doc is everything we changed from default settings. I'm willing to do additional research about other default settings we changed for the dashboard and fill that section of the google doc further.

hancush commented 3 years ago

I've drafted 1. and 5. in the doc, as well!

jeancochrane commented 3 years ago

Added 4 and 6! I think we're ready for a PR.

hancush commented 3 years ago

Converted the doc to Markdown and submitted a PR: https://github.com/datamade/how-to/pull/128. I think it would be good for the three of us to review the sections we didn't write before we submit this for final review.

hancush commented 3 years ago

I will open a seaprate issue for documenting leanrs from Meytro dashboar dreorg.