Data4Democracy / internal-displacement

Studying news events and internal displacement.
43 stars 27 forks source link

Infrastructure Plan #86

Open WanderingStar opened 7 years ago

WanderingStar commented 7 years ago

Here's a sketch of an infrastructure plan:

Development

Scrapers run locally (on developer machine) in Docker for prototyping (internal-displacement repo) Write to local DB in docker Can read scrape requests from database, but most scrapes will be triggered manually (through notebooks or scripts)

Web app runs locally in Docker for prototyping (internal-displacement-web repo) Reads local DB in docker Writes scrape requests to database

IDETECT Preparation

Scraper and Web app Docker containers deployed to AWS instance or similar cloud-hosted infrastructure Read and write to Amazon RDS database Large batch of scrape requests input into database, read from there and processed by Scraper(s)

WanderingStar commented 7 years ago

Comments? Suggestions?

WanderingStar commented 7 years ago

The purpose of using Docker is to encourage folks to properly encapsulate any dependencies (Texract has a bunch of non-Python dependencies, for example) so that when the time comes to deploy to a cloud instance we don't have to wrangle them.

georgerichardson commented 7 years ago

@WanderingStar This looks good to me. By "read scrape requests from the database" do you mean that new additions to the database could automatically trigger scraping?

WanderingStar commented 7 years ago

Yes, exactly. The idea would be that we would have a table listing scraping requests and their statuses. The scraper would look for any requests with a NEW status or whatever and scrape them (changing the one it was working on to IN PROGRESS while it worked and COMPLETE or DUPLICATE or whatever when it's done).

This allows us to start more than one scraper if necessary during the initial data ingest and not have them step on each others' toes.

One possible downside to this scheme: If you submit a URL using the web interface and want to wait and see the results, the web server will have to repeatedly check (poll) to see if the request has been completed. This is potentially inefficient if there are a lot of individual URL submissions. But I think that that is not an anticipated scenario right now.

WanderingStar commented 7 years ago

Incidentally, this is a fairly common "work queue" pattern.

domingohui commented 7 years ago

Hi @WanderingStar thanks for the infra plan suggestion! I don't have much experience in a work flow like this, but I'm just trying to understand it. Does the scraper have to periodically check with the DB for new scrape requests? I assume the scraper will be running all the time?

Also, regarding

One possible downside to this scheme

do you think we can abstract a backend layer just to communicate with DB? which can be used for both scraper to update the tables, and the frontend webapp to query the result? Specifically to your concern of efficiency, the abstracted backend can just send a response to frontend whenever a request scrape is finished by the scraper and updated on DB.

This may sound like extra unnecessary work, but another benefit is that both the scraper and the webapp wouldn't be affected if we decided to switch to another DB service. Also, the DB query api can be simplified (e.g. SELECT and UPDATE statements won't have to exist in the scraper/webapp anymore. -> easier to maintain)

What do you think? Please let me know if my wording is unclear..

WanderingStar commented 7 years ago

I strongly recommend using Data Access Objects to talk to the DB, rather than writing SELECT and UPDATE statements. I'm planning to use SQLAlchemy on the Python side. You can see what that usage would look like here: https://github.com/Data4Democracy/internal-displacement/blob/schema/internal_displacement/tests/test_model.py (note that this branch is still work in progress, and not merged yet)

So you create an Article object, and assign its list of Categories, and all of the DB interaction is handled for you.

WanderingStar commented 7 years ago

And to answer your question about the scraper, yes, we'd want to have a scraper periodically checking the DB for URLs that need scraping. But, of course, if you're running Python interactively, you can just tell the scraper to scrape a particular URL.

domingohui commented 7 years ago

Thanks @WanderingStar! The reason I brought it up is that I see your PR is still using explicit SQL commands. But if you are already working on something like sqlalchemy or some sort of DAO, then that's great!

WanderingStar commented 7 years ago

Gotcha. #77 was the "get the AWS RDS DB working just like the SQLite DB" PR. There will be a PR forthcoming off of the schema branch that uses SQLAlchemy. (see also https://github.com/Data4Democracy/internal-displacement/issues/73)

domingohui commented 7 years ago

Just out of curiosity, does this mean the sqlite3 dependency in SQLAriticleInterface will be replaced with SQLalchemy eventually?

WanderingStar commented 7 years ago

Sort of. SQLAlchemy uses the low-level drivers like sqlite3 and psycopg2 to talk to SQLite and PostgreSQL databases. You won't have to interact with anything but the model classes in your code.

domingohui commented 7 years ago

Sounds good. Just making sure because I didn't see the changes :)