DataONEorg / d1_cn_index_processor

The CN index processor component
0 stars 1 forks source link

Update EML Semantic Annotation indexing to include and expand property URIs #27

Closed amoeba closed 2 years ago

amoeba commented 3 years ago

EML Semantic Annotations are represented in EML using a structure like

<annotation>
  <propertyURI label="some label">http://example.com/some_uri</propertyURI>
  <valueURI label="some other label">http://example.com/some_other_uri</valueURI>
</annotation>

To provide search for the above metadata, we extract and parse the character data from the valueURI element as an IRI, query the OntologyModelService for any parent classes for the IRI, and smush all these terms together in the sem_annotation field. We kept the indexing rules narrowly-focused as a start because we were planning on using EML Semantic Annotations narrowly to start with. It's catching on within our teams and also within external teams and the use is outstripping the implementation.

Over on https://github.com/NCEAS/metacatui/issues/1807, I'm breaking apart the popover widgets we show on dataset landing pages that contain EML Semantic Annotations into two separate popovers: One for the propertyURI and one for the valueURI. A key part of that widget is a link that searches for other datasets annotated with the term you're viewing. Because propertyURIs aren't being expanded and stored in the search index, searches for datasets annotated with a specific propertyURI don't work.

I propose we expand what we store in the sem_annotation field to cover the the valueURI and propertyURI and of course any expanded terms (superclasses for valueURI and superproperties for propertyURI). I could see us developing a more structured indexing approach for EML Semantic Annotations but I don't think we need it at this point so I'm opting for the small change.

This change will will require re-indexing the ~200-300 EML docs with semantic annotations in them. The number might grow before re-indexing is complete.

amoeba commented 3 years ago

PR'd and merged onto develop.