Closed leventov closed 4 years ago
@leventov great start outlining this epic! I will put in some work documenting the current architecture and would love to chat with you about the best options for expanding upon that based on the requirements of this component.
Research Airtable API for hooks
Airtable doesn't have hooks natively, but we have two options here:
There is also one prerequisite step in the pipeline that would enable the others, which is Crunchbase ODM integration. Right now we're doing this in bulk using a Python script I wrote, but we could also include it in this pipeline using their HTTP API. I would be pretty amazing actually, enabling us to take a user submission with only name and domain and fetch everything else automatically!
So Netlify that we use for CI and website hosting also supports serverless functions. I will explore if they will have scheduling options (apart from mere reaction to activities on the website, such as somebody submitting a new org). If scheduling is possible, I think we should definitely use Netlify functions to automate data scraping.
I will also research Zapier and other hook options.
@leventov I'm looking at Netlify and Zapier pricing now. It looks like we'll be fine for Netlify with the free tier (125k req/mo for functions, 100hrs runtime) and Zapier could work (100 tasks/mo) if we use their scheduler on a daily basis. We actually use Zapier right now to deploy daily.
@leventov I'm going to write up our architecture docs now, but wanted to share one thought I had...
I think there's a strong case for avoiding Airtable for storing the enrichment data, such as Twitter profiles. A few reasons for this:
I would be happy to pay $9/mo for Heroku's Hobby Postgres tier. I've used it on many projects and it's rock solid, fast, and zero-maintenance. It would give us a lot more flexibility to process all the enrichment data quickly without having to worry about the above limitations.
The one exception would be Crunchbase data, which is critical to our content workflow and already stored in Airtable.
@leventov FYI I included the architecture writeup in our README:
https://github.com/climatescape/climatescape.org#architecture
After researching Netlify functions I actually lean toward not using them:
I think about using Apify:
When somebody submits an entry through the website, we can use a Netlify function to tick off an Apify scraping job via Apify API.
Scheduling can be done via either Zapier or Apify itself.
Drawbacks of Apify:
I think these two are important drawbacks, but it doesn't seem to me that Netlify functions will fit the bill.
In terms of pricing, Apify's $49/mo plan should be sufficient, I think (or maybe even free). I don't think this is important, for now I can pay it myself. In the future, if we setup Patreon or something like that for Climatescape, I think it won't be a problem.
@leventov thanks for looking into Apify, it's super interesting. Here are my thoughts:
A simple NodeJS API fetcher + Postgres datastore running on Heroku makes the most sense to me, but that could be because that's the stack I'm most comfortable with personally. We could host it for a few bucks a month and it would be super flexible for other projects in the future.
Thoughts?
I'm concerned with IP detection/rate limiting and other anti-scraping protections on the websites. If we bump into some of them with a custom solution on top of bare Node.JS, it might take significant effort or added complexity to overcome them. Apify helps to hedge this risk.
I will research if there are some other services, like ScraperApi, which help to deal with this complexity, in a more library than a framework fashion.
@leventov the APIs we're interested in leveraging are designed to be heavily queried for use cases exactly like ours. For example, Twitter's users/show has a limit of 900 requests/15 minutes. Google's Geocoder, another endpoint we're interested in, has a limit of 50 requests/second. Crunchbase doesn't publish rate limits but I'd guess they're generous–if not we can use their CSV export like we do manually today.
Some websites (e.g. LinkedIn) do have anti-scraping protection and I think one of these solutions would be great for them. But for anyone who has an open API, using it directly is the preferred method.
@bloudermilk I now agree with a Heroku app and a DB as the path forward, considering that the backend will likely gain some complexity over time with ML or other stuff. If we stumble upon scraping problems, I think we should be able to circumvent them with ScraperApi at $29/mo pricing level.
Do you want to reuse your existing Heroku account or I will create a new one?
@leventov awesome. I'll set up a Heroku team and invite you to it!
I'm currently working on the following design:
@leventov I think AMPQ is overkill for this process. Given the nature of the job at hand we'd be just as well served by a process that polls at some interval. With logic like:
Your suggestion to store the results as new rows always is a great call. It will be nice to have the historical record here and avoid UPDATE
s.
I would like to see the key information we're actually looking for synthesized into another table, be it a VIEW
or normal table. We should be able to access this reduced information from Gatsby directly via some API (REST or GraphQL) or a CSV export. I don't think we need/want to push this back back to Airtable.
I'd really like to avoid having to maintain an AMPQ deployment for this. If I'm missing something and we must use queue logic for some reason, Postgres is more than capable of handling that kind of workflow with transactions. However I suspect a simple polling mechanism will do the job just fine with much less complexity.
Heroku docs strongly push towards offloading any work dependent on potentially slow third-party requests to a background worker, and scraping is exactly this type of work.
The first target is not all-orgs periodic scraping. The first target is upon-addition first-time scraping in order to be able to prioritize org submission for processing (or just future "care" in general) based on "Climatescape rank". To do this, Netlify hook will call something like POST <heroku-url>/scrape {"orgId": ..., "url": ...}
on Heroku web, which will then post a job to the queue for background processing.
It doesn't seem to me that AMPQ deployment complicates matters much. In some other respects, it even helps to simplify, because we don't have to deal with periods, priorities (e. g. when a one-time requests arrive in the middle of periodic batch scraping), and there is some observability out of the box (the number of outstanding messages in the queue). So AMPQ may be even a net positive in terms of complexity.
If you look at what building blocks Apify provides (which we discussed earlier) it's also a request queue and relational storage for results. So designing scraping around a queue should be a good fit.
I would like to see the key information we're actually looking for synthesized into another table, be it a VIEW or normal table. We should be able to access this reduced information from Gatsby directly via some API (REST or GraphQL) or a CSV export. I don't think we need/want to push this back back to Airtable.
We need to push back to Airtable one field: "Climatescape rank" so that people who deal with the content could sort the Airtable according to it.
@leventov thanks for the additional details.
To clarify, I'm also advocating for a background worker–just trying to keep things simple. Mainly I'm hesitant to introduce technology I'm not intimately comfortable with, since it's ultimately up to me to maintain this. But I also think picking the simplest tool that can get the job done is generally the best strategy when designing these kinds of systems.
I see where you're coming from with the queue model and agree this is the "right way" to structure jobs like this. I guess I was hoping we could knock out an MVP that worked w/o additional infrastructure, but I think you're right to want to include it. I know there are many queue libraries for JS that use a backing data store to manage jobs/retries/etc.
If we're going the queue route, I would strongly prefer to back it with a datastore I'm personally comfortable maintaining. That would be either Postgres or Redis, both of which make a fine choice for a queue with the kind of volume we're looking for. If we manage to keep it in Postgres that gives us one less service to depend on, onboard, and pay for.
I'd like to keep working on this design with you. Maybe we can catch up on a call this week?
Yeah, the storage behind the queue doesn't matter that match and adding a new service into the mix is definitely not free (I already started to feel this, having some difficulties setting up rabbitmq in docker compose for local dev), so I'll try to use https://github.com/timgit/pg-boss instead (PostgreSQL-based queues). Not https://github.com/graphile/worker because I don't see that it support multiple queues, while https://github.com/timgit/pg-boss does.
@leventov pg-boss looks like a great choice–nice find!
The next goal I'm working as part of this is figuring out how to make Netlify/Gatsby connect to the PostgreSQL in Heroku and pull the scraped data during website generation.
@leventov can we talk about this before you get too deep? I think we should go with your original idea of pushing back to Airtable. See my comments in #87
@leventov shipped our first version of this with a nice framework in place to add more to it. Closing this epic as we've broken it down into smaller issues #140 #141
Goals:
Plan: