dstl / baleen

Entity Extraction Text Processor
Apache License 2.0
148 stars 40 forks source link

Add option to only store normalised entities #26

Closed ghost closed 8 years ago

ghost commented 8 years ago

This change adds a configuration option to the Elasticsearch consumer to restrict storage of entities in the database to only those entities that have been normalised. By default, all entities are stored and restricting the storage to just normalised entities only occurs if the configuration parameter is supplied and its value is set to true.

This feature is useful when another persistent store, such as mongodb, is used to store all the entities and elasticsearch only needs the entity content for searching purposes. In this situation it is only necessary to store the unmodified full text and any normalised (modified) entities to provide the data needed for searching. Only storing the normalised entities reduces duplication of data and the associated storage.