Data4Democracy / internal-displacement

Studying news events and internal displacement.
43 stars 27 forks source link

Python process to check for new URLs and run the pipeline on them #127

Open WanderingStar opened 7 years ago

WanderingStar commented 7 years ago

We would like to have the front end be able to submit new URLs to process by writing an article row into the DB with a status of NEW. We need a process that runs on the back-end that looks for such rows and kicks of the scraping & interpretation pipeline.

Because it takes a while to bring up the interpretation environment (loading dependencies & model), it probably makes sense to have a long-running process that spends most of its time sleeping and occasionally (once every 60s? configurable?) wakes up and looks for new DB rows to process.

simonb83 commented 7 years ago

I love this idea. I also think it would be a good idea to split the Pipeline into two separate parts: URL Parsing & Classification / Report Extraction.

Perhaps we could then have two processes - one that looks for urls with status "New" and executes the Scraping code. The other one would look for urls with status "Fetched" and execute the remaining Classification / Report Extraction piece.