Closed domingohui closed 7 years ago
Hey, it's looking good. I think it would be good to take a step back and just think about the processing workflow a second.
The main thing that concerns me is loading Interpreter()
as this requires loading a 2GB language model, so just want to make sure we are loading at the right point, rather than having to reload it in various places.
In theory, I think the flow for a single URL is / should be:
In my mind I think we should manage this workflow in pipeline.py
and therefore only initialize Interpreter()
in the pipeline rather than also in the Article class.
Any thoughts?
I was debating if I should use interpreter in article.py
or pipeline.py
too. Now that I look at the updated schema, it makes sense to have it in pipeline since article is just a data model.
@simonb83 @georgerichardson Here's what I have so far. I'm not quite sure how date spans are processed for a Report. It seems to be converted into string based on the code we have so far. But I guess we will have to revise it when we update the code to use @WanderingStar's schema #100.