ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.
Apache License 2.0
354 stars 42 forks source link

Text Index. Issue with umlauts - ä, ö, ü #1399

Closed aindlq closed 2 weeks ago

aindlq commented 1 month ago

The following query fails:

SELECT * WHERE {
  ?text <http://qlever.cs.uni-freiburg.de/builtin-functions/contains-entity> ?subject;
    <http://qlever.cs.uni-freiburg.de/builtin-functions/contains-word> "thür*".
  ?subject a <http://www.cidoc-crm.org/cidoc-crm/E53_Place>
}

with:

Assertion ctre::match<"[$?][\\w]+">(_name) failed. Please report this to the developers. In file "/app/src/parser/data/Variable.cpp " at line 19

test.nt

<http://example.com/thuringen> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E53_Place> .

test.wordsfile.tsv

thüringen       0       1   1
<http://example.com/thuringen>       0       1   1

test.docsfile.tsv

1   Madonna del Suffragio mit den heiligen Franziskus und Klara, Ludwig von Frankreich und Elisabeth von Thüringen
Qup42 commented 1 month ago

A regex is used to check the validity of the variables name. The underlying problem here is that the regex only matches ASCII characters for \w. Variable names can consistent of more characters including (but not limited to) umlaute. The regex here is stronger than the SPARQL grammar.

?text ql:contains-word "thür*" creates a new variable ?ql_matchingword_text_thür behind the scenes. This triggers the root cause and leads to the error.

aindlq commented 1 month ago

Why not to use an index number for ?ql_matchingwordtext, so ?ql_matchingword_text_1, ?ql_matchingword_text_2, etc.?

On July 14, 2024 6:58:27 PM GMT+02:00, Julian @.***> wrote:

A regex is used to check the validity of the variables name. The underlying problem here is that the regex only matches ASCII characters for \w. Variable names can consistent of more characters including (but not limited to) umlaute. The regex here is stronger than the SPARQL grammar.

?text ql:contains-word "thür*" creates a new variable ?ql_matchingword_text_thür behind the scenes. This triggers the root cause and leads to the error.

-- Reply to this email directly or view it on GitHub: https://github.com/ad-freiburg/qlever/issues/1399#issuecomment-2227413449 You are receiving this because you authored the thread.

Message ID: @.***>

RobinTF commented 3 weeks ago

Here's what a compliant RegEx expression would look like:

https://godbolt.org/z/595399oPn

I'm not sure if this would immediately fix the issue though, ctre seems to be somewhat picky when trying to match unicode points (only got it working using std::u8string_view in the example

hannahbast commented 3 weeks ago

@RobinTF It's the REGEX that checks whether the variable name is valid and ?ql_matchingword_text_thür is not a valid variable name. We should find another name for this automatic variable. @aindlq suggests ?ql_matchingword_text_1 etc. I would prefer to have a syntax that lets the user choose the variable name because that would be more in the spirit of SPARQL. Any ideas on that?

RobinTF commented 3 weeks ago

Regardless on the solution to this issue ?ql_matchingword_text_thür really is a perfectly valid variable name according to the SPARQL grammar. See https://www.w3.org/TR/sparql11-query/#rVARNAME for the exact unicode ranges allowed in variable names I wrote the new RegEx according to this exact specification

hannahbast commented 3 weeks ago

@RobinTF Thanks a lot for pointing that out. I wasn't aware of that. Then we should indeed fix the REGEX.

The WDQS indeed accepts such variable names, here are two example queries: https://w.wiki/Atfv (German umlaut) and https://w.wiki/AuUM (Devanagari)