DivePlus / diveplus

Placeholder repo
Apache License 2.0
0 stars 0 forks source link

Document relevance / Result limited to 500 hits #11

Open c-martinez opened 7 years ago

c-martinez commented 7 years ago

Liliana: We are evaluating with Carlos and Sabrina the retrieval of entities and found out that because it only gives back 500 results, it is important to add a way of filtering in the qurey box per type of entity, otherwise, you get results with 500 allocated in just one entity and the other entity types with 0 hits. Example: search for Amsterdam, and you see you get 500 events and 0 in the other types. But a user may like to search per Amsterdam as location (entity type) from the query box, and then you see results more distributed across the other types of entities

c-martinez commented 7 years ago

There is a trade off between number of results and the size of the data search query is currently limited to 500.

We (together with @biktorrr) have discussed this and concluded that we need an external document relevance ranking mechanism (e.g. TFIDF). This is not directly possible in SPARQL, but is maybe an option to do this using ElasticSearch via SPARQL SERVICE directive as done in this example.

c-martinez commented 7 years ago

Also relevant to this discussion: https://docs.google.com/document/d/1GSd9ym_50jCX_nSpQ4HWD_-WGb8EKqAPw-llIwwdoF8/edit#

c-martinez commented 7 years ago

Possibly also an interesting reference for this issue: http://dl.acm.org/citation.cfm?id=2591350

c-martinez commented 7 years ago

Also of interest: https://blog.blazegraph.com/?tag=solr (using solr and blazegraph, instead of ES and cliopatria)

c-martinez commented 7 years ago

Proposed solution:

With this setup, a query like this can be run:

SELECT * WHERE
{
  SERVICE <http://blazegraph-server:9999/blazegraph/namespace/kb/sparql>
  {
    ?kwDoc <http://www.bigdata.com/rdf/fts#search> "x" .
    ?kwDoc <http://www.bigdata.com/rdf/fts#endpoint> "http://solr-server:8983/solr/db/select" .
    ?kwDoc <http://www.bigdata.com/rdf/fts#params> "fl=id,score,name" .
    ?kwDoc <http://www.bigdata.com/rdf/fts#scoreField> "score" .
    ?kwDoc <http://www.bigdata.com/rdf/fts#score> ?score .
    ?kwDoc <http://www.bigdata.com/rdf/fts#snippetField> "name" .
    ?kwDoc <http://www.bigdata.com/rdf/fts#snippet> ?name .
  }
}

This will return name of documents, id's and scores.