Open WanderingStar opened 7 years ago
I love this idea. I also think it would be a good idea to split the Pipeline into two separate parts: URL Parsing & Classification / Report Extraction.
Perhaps we could then have two processes - one that looks for urls with status "New" and executes the Scraping code. The other one would look for urls with status "Fetched" and execute the remaining Classification / Report Extraction piece.
We would like to have the front end be able to submit new URLs to process by writing an article row into the DB with a status of NEW. We need a process that runs on the back-end that looks for such rows and kicks of the scraping & interpretation pipeline.
Because it takes a while to bring up the interpretation environment (loading dependencies & model), it probably makes sense to have a long-running process that spends most of its time
sleep
ing and occasionally (once every 60s? configurable?) wakes up and looks for new DB rows to process.