DataONEorg / dataone-indexer

DataONE Indexer subsystem
Apache License 2.0
0 stars 2 forks source link

Revisit annotation indexing rules and consider covering sameAs, equivalentClass, equivalentProperty, others? #3

Open amoeba opened 2 years ago

amoeba commented 2 years ago

@mbjones asked on Slack whether we index URIs for terms that are sameAs'd. I answered no but maybe we actually do and I just haven't seen it happen because we don't use sameAs much in our ontologies.

In our use of semantic search so far, we haven't made much use of alignment axioms (sameAs, equivalentClass, equivalentProperty) and have so far only used subClassOf (to materialize parent classes). I think the MOSAiC ontology is the first ontology we've enabled in search that uses things like NamedIndividuals and sameAs in a way we care about and I think our search isn't quite as good for MOSAiC as it is for others (like ECSO).

I think we could make a few changes to improve that and improve things for the future.

Change one: Include the class of any value URIs

For example, for the MOSAiC datasets we have in the Arctic Data Center, we inserted dataset-level annotations like

Screen Shot 2021-10-04 at 2 01 27 PM

Each of those annotations is to a NamedIndividual, rather than to a Class as we've been doing. How do we drive good searches for these? The only way right now is to search exactly for the term. But PS122/2, for example, is of class "_MOSAiC Specific Term" and "Campaign". I think it'd be nice if a person could search for PS122/2 directly but also by either of its classes. This same reasoning applies to the other two annotations in the screenshot above.

Change two: Include sameAs and equivalentClass/equivalentProperty relationships for terms

As another example, MOSAiC has a set of Research Location named individuals, like "Arctic Ocean". We didn't align this term but if we did in the future (say to http://purl.obolibrary.org/obo/GAZ_00000323), it'd be best if searches for either MOSAiC's term and GAZ's term returned documents annotated either way. This would benefit any future alignment work we do and improve searches.

Summary

Both changes, in addition to our current indexing rules, would be additive. That is, any search that works now also works with these changes. The current logic is:

With the changes above, we'd get:

I think this can be done by adjusting the SPARQL query we already use and I don't expect we'll have any performance issues or Solr index growth issues.

amoeba commented 2 years ago

This is going well and I'm nearly done. The query we run against every valueURI in every semantic annotation is now:

SELECT ?annotation_value_uri
WHERE
{
  {
    <$CONCEPT_URI> rdfs:subClassOf* ?annotation_value_uri .
  }
  UNION
  {
    <$CONCEPT_URI> rdf:type/rdfs:subClassOf* ?annotation_value_uri .
  }
  UNION
  {
    <$CONCEPT_URI> owl:sameAs/rdfs:subClassOf* ?annotation_value_uri .
  }
  UNION
  {
    <$CONCEPT_URI> owl:equivalentClass/rdfs:subClassOf* ?annotation_value_uri .
  }
}

This means we get some cool stuff we didn't have before. For example, if a person searches for datasets annotated with the NamedIndividual from ARCRC for "Snow Depth" (a key variable, shown in Orange), they get all of this back and searches for any of these terms return the dataset with this annotation:

image

Before, we would've just returned the the dataset if we searched directly for the NamedIndividual because we aren't expanding them at all.

The last thing I want to do is get n-degree sameAs working. For example, if A sameAs B, sameAs C, we want searches for any of A, B, or C to return A, B, and C.

mbjones commented 2 years ago

This is excellent, @amoeba. Thanks!

Did you and @mpsaloha discuss skos:exactMatch and skos:closeMatch as candidates for this type of alignment as well? We can always add them later, but if you're making changes and reindexing things, then it might make sense. Details from SKOS:

The property skos:closeMatch is used to link two concepts that are sufficiently similar that they can be used interchangeably in some information retrieval applications. In order to avoid the possibility of "compound errors" when combining mappings across more than two concept schemes, skos:closeMatch is not declared to be a transitive property.

The property skos:exactMatch is used to link two concepts, indicating a high degree of confidence that the concepts can be used interchangeably across a wide range of information retrieval applications. skos:exactMatch is a transitive property, and is a sub-property of skos:closeMatch.

amoeba commented 2 years ago

Thanks @mbjones, we hadn't. I'll run it by him but.

That you mention it, it seems like we oughta look at adding it in now. I don't think it'll have any practical impact at the moment because I don't think the ontologies we do query expansion on use those terms but it's probably better to put the change in now to save ourselves some time.

amoeba commented 2 years ago

Added support for skos:exactMatch and skos:closeMatch in https://github.com/DataONEorg/d1_cn_index_processor/commit/24abd6cb775e509b9ec00964460083d59ad36828.

For skos:exactMatch, we treat it as transitive and symmetric. For skos:closeMatch, we don't treat it as transitive but do treat it as symmetric. This was based on suggestion from @mpsaloha and the logic makes sense and matches the quoted information from @mbjones above.

mbjones commented 2 years ago

LGTM.