cygri / vocidex

Search over RDF schemas and OWL ontologies
MIT License
11 stars 6 forks source link

Use LOV statistics for ranking #1

Open cygri opened 10 years ago

cygri commented 10 years ago

LOV contains some statistics on vocabulary usage that should be excellent for improved ranking.

cygri commented 10 years ago

There are three parts to this:

  1. How to get the numbers from the LOV dump
  2. How to get them stored in the index
  3. How to configure the index or queries to make use of the information

This comment focuses on the second point.

How to include LOV numbers in the indexed documents

The key is the LOVWrapper class. This is a wrapper around a VocabularyTermExtractor. The VocabularyTermExtractor iterates over class/property descriptions extracted from an RDFS/OWL Model. Now the LOVWrapper modifies these descriptions with LOV-specific stuff. For example, it adds a “vocabulary” field to the JSON with information about the vocabulary that defines the term. Here you could also add scoring information. The best way to do that is probably:

  1. Add a new Describer (similar to TermDescriber) that adds scores for a given class/property. Perhaps call it TermLOVScoreDescriber or somesuch.
  2. Instantiate that Describer in the LOVWrapper constructor, and invoke it in modifyDocument()
  3. To instantiate the Describer, you will need to pass the SPARQLRunner from LOVExtractor to LOVWrapper so that the Describer has access to the full LOV dataset including the scoring information in named graphs.