climatescape / climatescape.org

Discover the organizations solving climate change
https://climatescape.org/
MIT License
77 stars 20 forks source link

Epic: automated data scraping pipeline #40

Closed leventov closed 4 years ago

leventov commented 4 years ago

Goals:

  1. Rank the organizations to add/review based on some importance indicators: public/private, market cap, assets under management, headcount, Facebook followers, Twitter followers, etc. (whatever we can scrape automatically)
  2. Automatically refresh the org entries stored in Airtable, in background

Plan:

  1. What is the architecture of the project? On what hosting does it run? (ask @bloudermilk). Describe this in Notion (@bloudermilk / @leventov)
  2. Determine what scheduling/execution facilities can be used there to set up automatic tasks, or if none, what third-party automation/serverless service we can employ (or, perhaps, that we should go all-in for a hosting or a cloud account). (@bloudermilk / @leventov)
  3. Research Airtable API for hooks on entry add so that we can schedule a scraping job to produce an initial rank of the org to place it in the priority order in the content queue. Research Airtable API for how to update data, tables. (@leventov)
  4. Based on the above info gathered, create a more detailed design for the automation pipeline. (@leventov)
  5. Implement the part of initial info scraping with one source, Twitter followers. (@leventov)
bloudermilk commented 4 years ago

@leventov great start outlining this epic! I will put in some work documenting the current architecture and would love to chat with you about the best options for expanding upon that based on the requirements of this component.

Research Airtable API for hooks

Airtable doesn't have hooks natively, but we have two options here:

There is also one prerequisite step in the pipeline that would enable the others, which is Crunchbase ODM integration. Right now we're doing this in bulk using a Python script I wrote, but we could also include it in this pipeline using their HTTP API. I would be pretty amazing actually, enabling us to take a user submission with only name and domain and fetch everything else automatically!

leventov commented 4 years ago

So Netlify that we use for CI and website hosting also supports serverless functions. I will explore if they will have scheduling options (apart from mere reaction to activities on the website, such as somebody submitting a new org). If scheduling is possible, I think we should definitely use Netlify functions to automate data scraping.

I will also research Zapier and other hook options.

bloudermilk commented 4 years ago

@leventov I'm looking at Netlify and Zapier pricing now. It looks like we'll be fine for Netlify with the free tier (125k req/mo for functions, 100hrs runtime) and Zapier could work (100 tasks/mo) if we use their scheduler on a daily basis. We actually use Zapier right now to deploy daily.

bloudermilk commented 4 years ago

@leventov I'm going to write up our architecture docs now, but wanted to share one thought I had...

I think there's a strong case for avoiding Airtable for storing the enrichment data, such as Twitter profiles. A few reasons for this:

I would be happy to pay $9/mo for Heroku's Hobby Postgres tier. I've used it on many projects and it's rock solid, fast, and zero-maintenance. It would give us a lot more flexibility to process all the enrichment data quickly without having to worry about the above limitations.

The one exception would be Crunchbase data, which is critical to our content workflow and already stored in Airtable.

bloudermilk commented 4 years ago

@leventov FYI I included the architecture writeup in our README:

https://github.com/climatescape/climatescape.org#architecture

leventov commented 4 years ago

After researching Netlify functions I actually lean toward not using them:

I think about using Apify:

When somebody submits an entry through the website, we can use a Netlify function to tick off an Apify scraping job via Apify API.

Scheduling can be done via either Zapier or Apify itself.

Drawbacks of Apify:

I think these two are important drawbacks, but it doesn't seem to me that Netlify functions will fit the bill.

In terms of pricing, Apify's $49/mo plan should be sufficient, I think (or maybe even free). I don't think this is important, for now I can pay it myself. In the future, if we setup Patreon or something like that for Climatescape, I think it won't be a problem.

bloudermilk commented 4 years ago

@leventov thanks for looking into Apify, it's super interesting. Here are my thoughts:

A simple NodeJS API fetcher + Postgres datastore running on Heroku makes the most sense to me, but that could be because that's the stack I'm most comfortable with personally. We could host it for a few bucks a month and it would be super flexible for other projects in the future.

Thoughts?

leventov commented 4 years ago

I'm concerned with IP detection/rate limiting and other anti-scraping protections on the websites. If we bump into some of them with a custom solution on top of bare Node.JS, it might take significant effort or added complexity to overcome them. Apify helps to hedge this risk.

I will research if there are some other services, like ScraperApi, which help to deal with this complexity, in a more library than a framework fashion.

bloudermilk commented 4 years ago

@leventov the APIs we're interested in leveraging are designed to be heavily queried for use cases exactly like ours. For example, Twitter's users/show has a limit of 900 requests/15 minutes. Google's Geocoder, another endpoint we're interested in, has a limit of 50 requests/second. Crunchbase doesn't publish rate limits but I'd guess they're generous–if not we can use their CSV export like we do manually today.

Some websites (e.g. LinkedIn) do have anti-scraping protection and I think one of these solutions would be great for them. But for anyone who has an open API, using it directly is the preferred method.

leventov commented 4 years ago

@bloudermilk I now agree with a Heroku app and a DB as the path forward, considering that the backend will likely gain some complexity over time with ML or other stuff. If we stumble upon scraping problems, I think we should be able to circumvent them with ScraperApi at $29/mo pricing level.

Do you want to reuse your existing Heroku account or I will create a new one?

bloudermilk commented 4 years ago

@leventov awesome. I'll set up a Heroku team and invite you to it!

leventov commented 4 years ago

I'm currently working on the following design:

bloudermilk commented 4 years ago

@leventov I think AMPQ is overkill for this process. Given the nature of the job at hand we'd be just as well served by a process that polls at some interval. With logic like:

Your suggestion to store the results as new rows always is a great call. It will be nice to have the historical record here and avoid UPDATEs.

I would like to see the key information we're actually looking for synthesized into another table, be it a VIEW or normal table. We should be able to access this reduced information from Gatsby directly via some API (REST or GraphQL) or a CSV export. I don't think we need/want to push this back back to Airtable.

I'd really like to avoid having to maintain an AMPQ deployment for this. If I'm missing something and we must use queue logic for some reason, Postgres is more than capable of handling that kind of workflow with transactions. However I suspect a simple polling mechanism will do the job just fine with much less complexity.

leventov commented 4 years ago

Heroku docs strongly push towards offloading any work dependent on potentially slow third-party requests to a background worker, and scraping is exactly this type of work.

The first target is not all-orgs periodic scraping. The first target is upon-addition first-time scraping in order to be able to prioritize org submission for processing (or just future "care" in general) based on "Climatescape rank". To do this, Netlify hook will call something like POST <heroku-url>/scrape {"orgId": ..., "url": ...} on Heroku web, which will then post a job to the queue for background processing.

It doesn't seem to me that AMPQ deployment complicates matters much. In some other respects, it even helps to simplify, because we don't have to deal with periods, priorities (e. g. when a one-time requests arrive in the middle of periodic batch scraping), and there is some observability out of the box (the number of outstanding messages in the queue). So AMPQ may be even a net positive in terms of complexity.

leventov commented 4 years ago

If you look at what building blocks Apify provides (which we discussed earlier) it's also a request queue and relational storage for results. So designing scraping around a queue should be a good fit.

leventov commented 4 years ago

I would like to see the key information we're actually looking for synthesized into another table, be it a VIEW or normal table. We should be able to access this reduced information from Gatsby directly via some API (REST or GraphQL) or a CSV export. I don't think we need/want to push this back back to Airtable.

We need to push back to Airtable one field: "Climatescape rank" so that people who deal with the content could sort the Airtable according to it.

bloudermilk commented 4 years ago

@leventov thanks for the additional details.

To clarify, I'm also advocating for a background worker–just trying to keep things simple. Mainly I'm hesitant to introduce technology I'm not intimately comfortable with, since it's ultimately up to me to maintain this. But I also think picking the simplest tool that can get the job done is generally the best strategy when designing these kinds of systems.

I see where you're coming from with the queue model and agree this is the "right way" to structure jobs like this. I guess I was hoping we could knock out an MVP that worked w/o additional infrastructure, but I think you're right to want to include it. I know there are many queue libraries for JS that use a backing data store to manage jobs/retries/etc.

If we're going the queue route, I would strongly prefer to back it with a datastore I'm personally comfortable maintaining. That would be either Postgres or Redis, both of which make a fine choice for a queue with the kind of volume we're looking for. If we manage to keep it in Postgres that gives us one less service to depend on, onboard, and pay for.

I'd like to keep working on this design with you. Maybe we can catch up on a call this week?

leventov commented 4 years ago

Yeah, the storage behind the queue doesn't matter that match and adding a new service into the mix is definitely not free (I already started to feel this, having some difficulties setting up rabbitmq in docker compose for local dev), so I'll try to use https://github.com/timgit/pg-boss instead (PostgreSQL-based queues). Not https://github.com/graphile/worker because I don't see that it support multiple queues, while https://github.com/timgit/pg-boss does.

bloudermilk commented 4 years ago

@leventov pg-boss looks like a great choice–nice find!

leventov commented 4 years ago

The next goal I'm working as part of this is figuring out how to make Netlify/Gatsby connect to the PostgreSQL in Heroku and pull the scraped data during website generation.

bloudermilk commented 4 years ago

@leventov can we talk about this before you get too deep? I think we should go with your original idea of pushing back to Airtable. See my comments in #87

bloudermilk commented 4 years ago

@leventov shipped our first version of this with a nice framework in place to add more to it. Closing this epic as we've broken it down into smaller issues #140 #141