Extract Article features. Get and insert Reports to DB

domingohui commented 7 years ago

@simonb83 @georgerichardson Here's what I have so far. I'm not quite sure how date spans are processed for a Report. It seems to be converted into string based on the code we have so far. But I guess we will have to revise it when we update the code to use @WanderingStar's schema #100.

simonb83 commented 7 years ago

Hey, it's looking good. I think it would be good to take a step back and just think about the processing workflow a second.

The main thing that concerns me is loading Interpreter() as this requires loading a 2GB language model, so just want to make sure we are loading at the right point, rather than having to reload it in various places.

In theory, I think the flow for a single URL is / should be:

Receive URL
Attempt to parse (and save content, title, domain etc. to database?)
Load Interpreter
Use Interpreter to extract:
- Language
- Reports
- Locations etc.
Update article attributes based on output from Interpreter

In my mind I think we should manage this workflow in pipeline.py and therefore only initialize Interpreter() in the pipeline rather than also in the Article class.

Any thoughts?

domingohui commented 7 years ago

I was debating if I should use interpreter in article.py or pipeline.py too. Now that I look at the updated schema, it makes sense to have it in pipeline since article is just a data model.

Data4Democracy / internal-displacement

Extract Article features. Get and insert Reports to DB #101