Epic: automated data scraping pipeline

leventov commented 4 years ago

Goals:

Rank the organizations to add/review based on some importance indicators: public/private, market cap, assets under management, headcount, Facebook followers, Twitter followers, etc. (whatever we can scrape automatically)
Automatically refresh the org entries stored in Airtable, in background

Plan:

What is the architecture of the project? On what hosting does it run? (ask @bloudermilk). Describe this in Notion (@bloudermilk / @leventov)
Determine what scheduling/execution facilities can be used there to set up automatic tasks, or if none, what third-party automation/serverless service we can employ (or, perhaps, that we should go all-in for a hosting or a cloud account). (@bloudermilk / @leventov)
Research Airtable API for hooks on entry add so that we can schedule a scraping job to produce an initial rank of the org to place it in the priority order in the content queue. Research Airtable API for how to update data, tables. (@leventov)
Based on the above info gathered, create a more detailed design for the automation pipeline. (@leventov)
Implement the part of initial info scraping with one source, Twitter followers. (@leventov)

bloudermilk commented 4 years ago

@leventov great start outlining this epic! I will put in some work documenting the current architecture and would love to chat with you about the best options for expanding upon that based on the requirements of this component.

Research Airtable API for hooks

Airtable doesn't have hooks natively, but we have two options here:

Use the Zapier integration to call a webhook on our end
Poll at some interval for new/updated records (this is how Zapier works)

There is also one prerequisite step in the pipeline that would enable the others, which is Crunchbase ODM integration. Right now we're doing this in bulk using a Python script I wrote, but we could also include it in this pipeline using their HTTP API. I would be pretty amazing actually, enabling us to take a user submission with only name and domain and fetch everything else automatically!

leventov commented 4 years ago

So Netlify that we use for CI and website hosting also supports serverless functions. I will explore if they will have scheduling options (apart from mere reaction to activities on the website, such as somebody submitting a new org). If scheduling is possible, I think we should definitely use Netlify functions to automate data scraping.

I will also research Zapier and other hook options.

bloudermilk commented 4 years ago

@leventov I'm looking at Netlify and Zapier pricing now. It looks like we'll be fine for Netlify with the free tier (125k req/mo for functions, 100hrs runtime) and Zapier could work (100 tasks/mo) if we use their scheduler on a daily basis. We actually use Zapier right now to deploy daily.

bloudermilk commented 4 years ago

@leventov I'm going to write up our architecture docs now, but wanted to share one thought I had...

I think there's a strong case for avoiding Airtable for storing the enrichment data, such as Twitter profiles. A few reasons for this:

We're using Airtable mainly to avoid building a CMS, but enrichment data doesn't need to be edited manually
API rate limiting is prohibitive for any serious background processing at 5 requests/second
Our base is already getting pretty crowded in terms of data volume
No support for rich data structures (e.g. JSON)

I would be happy to pay $9/mo for Heroku's Hobby Postgres tier. I've used it on many projects and it's rock solid, fast, and zero-maintenance. It would give us a lot more flexibility to process all the enrichment data quickly without having to worry about the above limitations.

The one exception would be Crunchbase data, which is critical to our content workflow and already stored in Airtable.

bloudermilk commented 4 years ago

@leventov FYI I included the architecture writeup in our README:

https://github.com/climatescape/climatescape.org#architecture

leventov commented 4 years ago

After researching Netlify functions I actually lean toward not using them:

The default runtime limit is 10 seconds, which won't work for a periodic walk through all existing org entries.
This is a bare functions platform, not geared towards data scraping.

I think about using Apify:

Designed specifically for scraping: IP rotation, headless scraping, etc.
Has request queue for which could be used for a periodic walk through all the existing orgs, and Dataset database for storing scraped data. Data is not stored there forever (retention is 7/14 days), but I'm not sure we even need it if we just generate the website leveraging this data soon after the data is collected. For fields that are important for change detection purposes (such as website URL, twitter URL, LinkedIn URL, etc.) we can still store them in AirTable. I'm not sure we actually need more than 5 requests/sec. Imagine we have 10k orgs. We can process them in 10k / 5 = 2000 seconds, about half an hour. Apify allows to configure delays.
Existing Contact details scraper can be useful. However, I don't see anything else in the free store that we can reuse - quite surprisingly, they don't even have Twitter/Facebook followers scrapers (only Instagram :)
Integrated with Zapier

When somebody submits an entry through the website, we can use a Netlify function to tick off an Apify scraping job via Apify API.

Scheduling can be done via either Zapier or Apify itself.

Drawbacks of Apify:

Yet another piece of infra to manage access to, access key storage, etc.
The platform as a whole is not testable locally. (Netlify Dev gave such hope.)

I think these two are important drawbacks, but it doesn't seem to me that Netlify functions will fit the bill.

In terms of pricing, Apify's $49/mo plan should be sufficient, I think (or maybe even free). I don't think this is important, for now I can pay it myself. In the future, if we setup Patreon or something like that for Climatescape, I think it won't be a problem.

bloudermilk commented 4 years ago

@leventov thanks for looking into Apify, it's super interesting. Here are my thoughts:

For services with a free API (e.g. Cruchbase, Twitter) I think it makes more sense to use the API directly
Apify could be really helpful for some of the non-API data we're currently pulling, like LinkedIn
It could also be helpful for things we want to be doing but are currently not
- Pulling organization website content for textual analysis (see Ish's project)
- Crawling investor portfolios for links to organizations we don't know about yet

A simple NodeJS API fetcher + Postgres datastore running on Heroku makes the most sense to me, but that could be because that's the stack I'm most comfortable with personally. We could host it for a few bucks a month and it would be super flexible for other projects in the future.

Thoughts?

leventov commented 4 years ago

I'm concerned with IP detection/rate limiting and other anti-scraping protections on the websites. If we bump into some of them with a custom solution on top of bare Node.JS, it might take significant effort or added complexity to overcome them. Apify helps to hedge this risk.

I will research if there are some other services, like ScraperApi, which help to deal with this complexity, in a more library than a framework fashion.

bloudermilk commented 4 years ago

@leventov the APIs we're interested in leveraging are designed to be heavily queried for use cases exactly like ours. For example, Twitter's users/show has a limit of 900 requests/15 minutes. Google's Geocoder, another endpoint we're interested in, has a limit of 50 requests/second. Crunchbase doesn't publish rate limits but I'd guess they're generous–if not we can use their CSV export like we do manually today.

Some websites (e.g. LinkedIn) do have anti-scraping protection and I think one of these solutions would be great for them. But for anyone who has an open API, using it directly is the preferred method.

leventov commented 4 years ago

@bloudermilk I now agree with a Heroku app and a DB as the path forward, considering that the backend will likely gain some complexity over time with ML or other stuff. If we stumble upon scraping problems, I think we should be able to circumvent them with ScraperApi at $29/mo pricing level.

Do you want to reuse your existing Heroku account or I will create a new one?

bloudermilk commented 4 years ago

@leventov awesome. I'll set up a Heroku team and invite you to it!

leventov commented 4 years ago

I'm currently working on the following design:

There is a very simple web process through which it will be possible to post scraping jobs to a queue in CloudAMPQ addon service
background worker getting jobs from the queue, processing them, and storing results in postgresql
Results are stored as new rows always:
- org id, e. g. "climeworks"
- request type, e. g. "twitterFollowers"
- timestamp when scraping was done
- request result, e. g. "100", or a string json in more complex cases
How to push these results back to Airtable/Netlify is TBD.

bloudermilk commented 4 years ago

@leventov I think AMPQ is overkill for this process. Given the nature of the job at hand we'd be just as well served by a process that polls at some interval. With logic like:

Every N minutes
Query Airtable for all organizations with a Twitter
For each organization
- If we haven't fetched the twitter profile in M minutes
- Fetch the profile and store in Postgres

Your suggestion to store the results as new rows always is a great call. It will be nice to have the historical record here and avoid UPDATEs.

I would like to see the key information we're actually looking for synthesized into another table, be it a VIEW or normal table. We should be able to access this reduced information from Gatsby directly via some API (REST or GraphQL) or a CSV export. I don't think we need/want to push this back back to Airtable.

I'd really like to avoid having to maintain an AMPQ deployment for this. If I'm missing something and we must use queue logic for some reason, Postgres is more than capable of handling that kind of workflow with transactions. However I suspect a simple polling mechanism will do the job just fine with much less complexity.

leventov commented 4 years ago

Heroku docs strongly push towards offloading any work dependent on potentially slow third-party requests to a background worker, and scraping is exactly this type of work.

The first target is not all-orgs periodic scraping. The first target is upon-addition first-time scraping in order to be able to prioritize org submission for processing (or just future "care" in general) based on "Climatescape rank". To do this, Netlify hook will call something like POST <heroku-url>/scrape {"orgId": ..., "url": ...} on Heroku web, which will then post a job to the queue for background processing.

It doesn't seem to me that AMPQ deployment complicates matters much. In some other respects, it even helps to simplify, because we don't have to deal with periods, priorities (e. g. when a one-time requests arrive in the middle of periodic batch scraping), and there is some observability out of the box (the number of outstanding messages in the queue). So AMPQ may be even a net positive in terms of complexity.

leventov commented 4 years ago

If you look at what building blocks Apify provides (which we discussed earlier) it's also a request queue and relational storage for results. So designing scraping around a queue should be a good fit.

leventov commented 4 years ago

I would like to see the key information we're actually looking for synthesized into another table, be it a VIEW or normal table. We should be able to access this reduced information from Gatsby directly via some API (REST or GraphQL) or a CSV export. I don't think we need/want to push this back back to Airtable.

We need to push back to Airtable one field: "Climatescape rank" so that people who deal with the content could sort the Airtable according to it.

bloudermilk commented 4 years ago

@leventov thanks for the additional details.

To clarify, I'm also advocating for a background worker–just trying to keep things simple. Mainly I'm hesitant to introduce technology I'm not intimately comfortable with, since it's ultimately up to me to maintain this. But I also think picking the simplest tool that can get the job done is generally the best strategy when designing these kinds of systems.

I see where you're coming from with the queue model and agree this is the "right way" to structure jobs like this. I guess I was hoping we could knock out an MVP that worked w/o additional infrastructure, but I think you're right to want to include it. I know there are many queue libraries for JS that use a backing data store to manage jobs/retries/etc.

If we're going the queue route, I would strongly prefer to back it with a datastore I'm personally comfortable maintaining. That would be either Postgres or Redis, both of which make a fine choice for a queue with the kind of volume we're looking for. If we manage to keep it in Postgres that gives us one less service to depend on, onboard, and pay for.

I'd like to keep working on this design with you. Maybe we can catch up on a call this week?

leventov commented 4 years ago

Yeah, the storage behind the queue doesn't matter that match and adding a new service into the mix is definitely not free (I already started to feel this, having some difficulties setting up rabbitmq in docker compose for local dev), so I'll try to use https://github.com/timgit/pg-boss instead (PostgreSQL-based queues). Not https://github.com/graphile/worker because I don't see that it support multiple queues, while https://github.com/timgit/pg-boss does.

bloudermilk commented 4 years ago

@leventov pg-boss looks like a great choice–nice find!

leventov commented 4 years ago

The next goal I'm working as part of this is figuring out how to make Netlify/Gatsby connect to the PostgreSQL in Heroku and pull the scraped data during website generation.

bloudermilk commented 4 years ago

@leventov can we talk about this before you get too deep? I think we should go with your original idea of pushing back to Airtable. See my comments in #87

bloudermilk commented 4 years ago

@leventov shipped our first version of this with a nice framework in place to add more to it. Closing this epic as we've broken it down into smaller issues #140 #141

climatescape / climatescape.org

Epic: automated data scraping pipeline #40