dstl / baleen

Entity Extraction Text Processor
Apache License 2.0
148 stars 40 forks source link

Check for deleted documents #18

Closed jle123 closed 8 years ago

jle123 commented 8 years ago

Baleen will not re-ingest documents it's already seen if it's stopped and then restarted. If a file is deleted while Baleen is offline, when Baleen is next started up, it will notice this change and stop tracking this file. Reading the file will cause Baleen to ingest it again.

Requires 'removeDeletedDocs: true' for any applicable consumers in the pipeline file. Also requires a consumer 'PipelineComplete' to be added as the last consumer with arguments 'storeDocumentDetails: true' and 'updatePipelineStatus: true'.

jbaker-dstl commented 8 years ago

Thanks for submitting this pull request. The functionality looks like it could be very useful. However, I'm hesitant to accept it as it currently stands for a few reasons.

Firstly, it introduces a dependency on Mongo into a number of components, including FolderReader (which is very widely used) and Elasticsearch (which has been used as an alternative to Mongo). I think for it to be really useful, we'd need to implement it in a generic way which could use Mongo but could also use other persistence mechanisms (similar to how the History works, perhaps?).

Secondly, it's perhaps something that should be more deeply linked to the Baleen core, such that it is part of the core functionality rather than requiring additional consumers to be added to pipelines? There are already places within the Baleen core that such functionality could hook into, similar to how the metrics work.

Finally, I think we'd want to make sure this was consistently implemented across all collection readers and consumers to avoid causing confusion.

@jle123, I don't know what your thoughts are on the above?

jle123 commented 8 years ago

I think you're right about allowing different persistence mechanisms to store information about already ingested documents. Perhaps where it's stored could be specified in the pipeline file and it goes to Mongo by deafult?

jbaker-dstl commented 8 years ago

I think that would be a sensible approach, and also trying to ensure it is applied in the same manner across all the collection readers.

Would you be happy for me to close this pull request, and perhaps you could open it as a feature request? I don't know who would be best placed to do this work or have the available time, but at least then it is documented somewhere and different approaches could be discussed.

jle123 commented 8 years ago

Yes, that would be OK. There's another feature in a new branch that adds another new functionality to Baleen that I would like to add, but it is dependent on this feature to some extent. I think we will implement your requested changes to this branch before pushing this new feature.