Text Index. All text queries fail on a tiny test data.

aindlq commented 1 month ago

I've been trying to better understand the way how current text index works using very tiny test data, but all queries fail when data is too simple.

query:

SELECT ?subject ?text WHERE {
  ?text <http://qlever.cs.uni-freiburg.de/builtin-functions/contains-entity> ?subject .
  ?text <http://qlever.cs.uni-freiburg.de/builtin-functions/contains-word> "madon*" .
  ?subject a <http://www.cidoc-crm.org/cidoc-crm/E53_Place> .
}

error message:

Assertion nofBytes > 0 failed. Please report this to the developers. In file "/app/src/index/IndexImpl.Text.cpp " at line 895

test.wordsfile.tsv:

<http://example.com/thuringen>  1   1   1
madonna 0   1   1
suffragio   0   1   1
heiligen    0   1   1
franziskus  0   1   1
klara   0   1   1
ludwig  0   1   1
frankreich  0   1   1
elisabeth   0   1   1

test.docsfile.tsv:

1   Madonna del Suffragio mit den heiligen Franziskus und Klara, Ludwig von Frankreich und Elisabeth

test.nt:

<http://example.com/thuringen> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E53_Place> .

azaroth42 commented 1 month ago

I get this error also:

PREFIX crm: <http://www.cidoc-crm.org/cidoc-crm/>
PREFIX lux: <https://lux.collections.yale.edu/ns/>
SELECT DISTINCT ?what ?txt ?atxt WHERE {
  ?what a crm:E22_Human-Made_Object ; lux:primaryName ?txt ; lux:agentOfProduction ?artist .
  ?artist lux:primaryName ?atxt .
  ?tt ql:contains-entity ?atxt ; ql:contains-word "van gogh" .
  ?t ql:contains-entity ?txt ; ql:contains-word "nuit" .
}

Error: AssertionnofBytes > 0failed. Please report this to the developers. In file "/app/src/index/IndexImpl.Text.cpp " at line 848

But if I change van to vincent it works as expected.

Update: For me, any token with three or fewer characters causes the exception. so "de nuit" is bad but "night nuit" is fine.

joka921 commented 1 month ago

Thanks for reporting all the issues with the Text Index.

It is one of the features that hasn't been under really active development in the last years. It is good that you show interest in this feature so we can prioritize it. I have a plan for a complete rewrite of the text index which should mitigate most of the current limitations, but I think the support for Named Graphs (which people are also asking for) has some priority.

ad-freiburg / qlever

Text Index. All text queries fail on a tiny test data. #1404