Data4Democracy / internal-displacement

Studying news events and internal displacement.
43 stars 27 forks source link

Extract Article features. Get and insert Reports to DB #101

Closed domingohui closed 7 years ago

domingohui commented 7 years ago

@simonb83 @georgerichardson Here's what I have so far. I'm not quite sure how date spans are processed for a Report. It seems to be converted into string based on the code we have so far. But I guess we will have to revise it when we update the code to use @WanderingStar's schema #100.

simonb83 commented 7 years ago

Hey, it's looking good. I think it would be good to take a step back and just think about the processing workflow a second.

The main thing that concerns me is loading Interpreter() as this requires loading a 2GB language model, so just want to make sure we are loading at the right point, rather than having to reload it in various places.

In theory, I think the flow for a single URL is / should be:

  1. Receive URL
  2. Attempt to parse (and save content, title, domain etc. to database?)
  3. Load Interpreter
  4. Use Interpreter to extract:
    • Language
    • Reports
    • Locations etc.
  5. Update article attributes based on output from Interpreter

In my mind I think we should manage this workflow in pipeline.py and therefore only initialize Interpreter() in the pipeline rather than also in the Article class.

Any thoughts?

domingohui commented 7 years ago

I was debating if I should use interpreter in article.py or pipeline.py too. Now that I look at the updated schema, it makes sense to have it in pipeline since article is just a data model.