ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.
Apache License 2.0
356 stars 42 forks source link

Text Index. All text queries fail on a tiny test data. #1404

Open aindlq opened 1 month ago

aindlq commented 1 month ago

I've been trying to better understand the way how current text index works using very tiny test data, but all queries fail when data is too simple.

query:

SELECT ?subject ?text WHERE {
  ?text <http://qlever.cs.uni-freiburg.de/builtin-functions/contains-entity> ?subject .
  ?text <http://qlever.cs.uni-freiburg.de/builtin-functions/contains-word> "madon*" .
  ?subject a <http://www.cidoc-crm.org/cidoc-crm/E53_Place> .
}

error message:

Assertion nofBytes > 0 failed. Please report this to the developers. In file "/app/src/index/IndexImpl.Text.cpp " at line 895

test.wordsfile.tsv:

<http://example.com/thuringen>  1   1   1
madonna 0   1   1
suffragio   0   1   1
heiligen    0   1   1
franziskus  0   1   1
klara   0   1   1
ludwig  0   1   1
frankreich  0   1   1
elisabeth   0   1   1

test.docsfile.tsv:

1   Madonna del Suffragio mit den heiligen Franziskus und Klara, Ludwig von Frankreich und Elisabeth

test.nt:

<http://example.com/thuringen> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E53_Place> .
azaroth42 commented 1 month ago

I get this error also:

PREFIX crm: <http://www.cidoc-crm.org/cidoc-crm/>
PREFIX lux: <https://lux.collections.yale.edu/ns/>
SELECT DISTINCT ?what ?txt ?atxt WHERE {
  ?what a crm:E22_Human-Made_Object ; lux:primaryName ?txt ; lux:agentOfProduction ?artist .
  ?artist lux:primaryName ?atxt .
  ?tt ql:contains-entity ?atxt ; ql:contains-word "van gogh" .
  ?t ql:contains-entity ?txt ; ql:contains-word "nuit" .
}

Error: AssertionnofBytes > 0failed. Please report this to the developers. In file "/app/src/index/IndexImpl.Text.cpp " at line 848

But if I change van to vincent it works as expected.

Update: For me, any token with three or fewer characters causes the exception. so "de nuit" is bad but "night nuit" is fine.

joka921 commented 1 month ago

Thanks for reporting all the issues with the Text Index.

It is one of the features that hasn't been under really active development in the last years. It is good that you show interest in this feature so we can prioritize it. I have a plan for a complete rewrite of the text index which should mitigate most of the current limitations, but I think the support for Named Graphs (which people are also asking for) has some priority.