chenejac / VIVOTestMigration

0 stars 0 forks source link

VIVO-1671: Search index grows extremely large when many publications are related to organizations #1558

Closed chenejac closed 4 years ago

chenejac commented 5 years ago

Benjamin Gross (Migrated from VIVO-1671) said:

I'm running into an issue where the search index grows at an unwieldy rate and fills up my disk, even with modest numbers of triples in the database. I have finally tracked down the source of my woes. 

Imagine this situation (-which matches what is produced via the VIVO-) edit: Not sure if this is the case for the triples below, but applies if vivo:relatedBy is used for anything not explicitly defined with a faux property).

Publication1 vivo:relatedBy Organization

Publication2 vivo:relatedBy Organization

 

The search indexer will populate the ALLTEXT fields for all the relevant data property values and object property labels for each entity. But apparently, it goes one step further on the tree when adding labels for object properties. That is, given the pseudo-triples above, the ALLTEXT field for Publication1 will include the label for itself, Organization, and Publication2

Imagine a situation where Organization is related to thousands of publications. The search document for Publication1 will include the labels of thousands of publications. Now imagine Publication1 is related to multiple organizations, which are all related to thousands of publications. This is my situation. :(

I get why this extra step in indexing makes sense in some situations, since VIVO uses context nodes to define important relationships (like authorship), but it certainly doesn't make sense in this situation and probably others like it. 

chenejac commented 5 years ago

Stefan Wolff said:

You could restrict the context node "extension_forContextNodes" in file "searchIndexerConfigurationVivo.n3" to a specific type, i.e.

:hasTypeRestriction "http://vivoweb.org/ontology/core#Relationship" ;

Or you just remove this context node configuration to prevent indexing of related labels.

chenejac commented 5 years ago

Benjamin Gross said:

Thanks Stefan, I think this solves my problem. I will do a little more thinking about the implications of adding the restriction, then propose adding the line into the configuration that ships with VIVO.

chenejac commented 5 years ago

Andrew Woods said:

Any updates, [~accountid:5bb229e412ef2d4bf3a2233d]?

chenejac commented 5 years ago

Benjamin Gross said:

Pull request: https://github.com/vivo-project/VIVO/pull/117

chenejac commented 5 years ago

Andrew Woods said:

Pending formatting comment in pull-request.

chenejac commented 5 years ago

Andrew Woods said:

Resolved with: https://github.com/vivo-project/VIVO/commit/ba2ee238ada199593236702b659dc645b8a180f0