datamade / how-to

📚 Doing all sorts of things, the DataMade way
MIT License
81 stars 12 forks source link

document github actions for scraping #212

Closed fgregg closed 1 year ago

fgregg commented 3 years ago

I've been experimenting with github actions a scraping platform and it's been really good.

example repos

twitter thread about it

i mentioned i was playing with this to @hancush, and she said it would be good to chat about it here.

@hancush, what would be a good next step?

hancush commented 3 years ago

Next step here is to ID a project with a nightly, e.g., scrape and develop a proof of concept for running that scrape with a GitHub Actions. Some ideas include Lugar scrape or ~CPS (if it still does a nightly sync)~.

hancush commented 3 years ago

Probably want to prioritize app deployed on Heroku?

hancush commented 3 years ago

How to determine whether to use GitHub Actions or Heroku scheduler?

hancush commented 2 years ago

We've used GitHub Actions this for a lot of projects, both DataMade and personal. At this point, we are considering it as a potential alternative to Airflow, as it gives so many of the same upsides without the app overhead.

In general, we like it! Some downsides include pricing for private repos (#270) and imprecise cron runs. But the upsides are an awesome interface well integrated with version control and super simple configuration.

hancush commented 2 years ago

@smcalilly recommends AWS step functions as a cheaper alternative to explore.

smcalilly commented 2 years ago

Just to write down what I said in R&D, I don't think they'd be a good solution for integrating with GitHub Actions. They would be most useful if we have a long running (15+ minutes, which is the lambda timeout limit), multi-step task where we want to keep the data and code private. They're real nice for orchestrating tasks and creating a state machine with lambdas, and you have AWS APIs at your fingertips (including cloudwatch for observability). You basically connect a series of lambdas together as a state machine, and you can use yaml with the serverless framework to configure and provision any AWS resources you need.

fgregg commented 1 year ago

I think i'm ready to push for this approach. @hancush is the next step to write a stack change proposal document?

hancush commented 1 year ago

@fgregg That's correct!

fgregg commented 1 year ago

here's the type of doc to write as reference: https://github.com/datamade/how-to/blob/56087d662a3081c8e6189393378eec978eed060c/django/wagtail/research/recommendation-of-adoption.md