KnowledgeLinks / rdfframework

RdfFramework Python Package
MIT License
3 stars 0 forks source link

bf_InstanceOf vs bf_hasInstance and bf_itemOf vs bf_hasItem #15

Closed mstabile75 closed 6 years ago

mstabile75 commented 6 years ago

If there is no hard reason not to. It is better for indexing if to use the has version of properties since in an ideal index scenario work flows to instance down to item in one Elasticsearch document.

Jeremy can you provide feedback on the challenges of switching these owl:inverseOf usages

jermnelson commented 6 years ago

The biggest challenges of reversing the relationships between BF Works, Instance, and Items for Elasticsearch has more to do with the incoming data from LOC marc2bibframe that links the most granular entity class with it corresponding more abstract entity (i.e. BF Item itemOf BF Instance) and changing the current RML mappings for LOC BF to Lean BF. Also, I wonder in a multiple institution situation (i.e. Alliance BIBCAT Goldrush project) where some BF Works may have hundreds of Instances and corresponding Items stored in a single document may have some performance implications verses smaller ES documents if we used a different indexing strategy. This may not matter in the short run but we should do some stress testing on sample ES docs with hundreds of Instances and Items.

mstabile75 commented 6 years ago

fixed in commit 5323221 es_json can now handle owl:inverseOf relationship. handling must be done in the the definitions file. Like this:

bf:hasInstance kds:rangeDef [ kds:appliesToClass kdr:AllClasses ; kds:esLookup owl:inverseOf ] .

bf:instanceOf kds:rangeDef [ kds:appliesToClass kdr:AllClasses ; kds:esIndexType es:Ignored ] .

The es:Ignored and kds:esLookup are used in conjunction to avoid recursive nesting of the inverse properties

jermnelson commented 6 years ago

Where would these triples reside? I'm worried we're diverging too much from the RML spec by introducing our own vocabulary triples to the RML map.

mstabile75 commented 6 years ago

These currently reside in the bibcat/rdfw-definitions/bc_core_links.ttl file. They are not connected to RML in away. I can't use RML for the elasticsearch conversion since the conversion process needs to be tightly woven with the core rdf vocabularies. The elasticsearch conversion makes assumptions based on the the core vocabularies. When those assumptions fail, like, in this case, an override option can be added to the active_defs triplestore. Envision the elasticsearch index as a 'as close as possible' representation of the data in the triplestore and interaction between the two should be transparent outside of the core system. The RML processor conversions should be a translation between external and core (i.e. knowledge links bibcat) datasets.

Where the RML and elasticseach will intersect is for caching. example:

Caching process:

  1. query the triplestore for an item and associated data
  2. data loaded into an RdfDataset
  3. Embedded in the dataset is the mapping and conversion to a rdfframework elasticsearch document
  4. Run any RML processors against the dataset for caching purposes and add those text dumps to the elasticsearch document. With the json_qry options in the RML we should not need to requery the triplestore at this point.
  5. post the document to elasticsearch