DistrictDataLabs / baleen

An automated ingestion service for blogs to construct a corpus for NLP research.
MIT License
86 stars 38 forks source link

Change post object in order to avoid duplicate fetch #84

Open tmeshorer opened 7 years ago

tmeshorer commented 7 years ago

So the goal was to avoid duplicate fetch of a post object that is already in Mongo. Alas, even if the post object is in mongo, we might have fetched it or not.

Hence, the post has three states:

1) Not in mongo 2) In mongo but did not fetch HTML. In this case the content field in the Post object will contain the "summary" 3) In mongo and have the HTML saved. In this case the content field in the Post object will contain the HTML.

Hence, I suggest that we have a state field to the Post object which will have the following states:

1) CREATED / WRANGLED / FETCHED

Also add a method:

is_in_mongo() and was_fetched()

And the following behavior ! is_in_mongo() -> wrangle the post and fetch the html is_in_mongo() && ! was_fetched() -> fetch the html and set the content to the html is_in_mongo() && was_fetched() -> get the post from mongo and return it.