Closed aindlq closed 2 weeks ago
A regex is used to check the validity of the variables name. The underlying problem here is that the regex only matches ASCII characters for \w
. Variable names can consistent of more characters including (but not limited to) umlaute. The regex here is stronger than the SPARQL grammar.
?text ql:contains-word "thür*"
creates a new variable ?ql_matchingword_text_thür
behind the scenes. This triggers the root cause and leads to the error.
Why not to use an index number for ?ql_matchingwordtext, so ?ql_matchingword_text_1, ?ql_matchingword_text_2, etc.?
On July 14, 2024 6:58:27 PM GMT+02:00, Julian @.***> wrote:
A regex is used to check the validity of the variables name. The underlying problem here is that the regex only matches ASCII characters for
\w
. Variable names can consistent of more characters including (but not limited to) umlaute. The regex here is stronger than the SPARQL grammar.
?text ql:contains-word "thür*"
creates a new variable?ql_matchingword_text_thür
behind the scenes. This triggers the root cause and leads to the error.-- Reply to this email directly or view it on GitHub: https://github.com/ad-freiburg/qlever/issues/1399#issuecomment-2227413449 You are receiving this because you authored the thread.
Message ID: @.***>
Here's what a compliant RegEx expression would look like:
https://godbolt.org/z/595399oPn
I'm not sure if this would immediately fix the issue though, ctre seems to be somewhat picky when trying to match unicode points (only got it working using std::u8string_view
in the example
@RobinTF It's the REGEX that checks whether the variable name is valid and ?ql_matchingword_text_thür
is not a valid variable name. We should find another name for this automatic variable. @aindlq suggests ?ql_matchingword_text_1
etc. I would prefer to have a syntax that lets the user choose the variable name because that would be more in the spirit of SPARQL. Any ideas on that?
Regardless on the solution to this issue ?ql_matchingword_text_thür
really is a perfectly valid variable name according to the SPARQL grammar.
See https://www.w3.org/TR/sparql11-query/#rVARNAME for the exact unicode ranges allowed in variable names
I wrote the new RegEx according to this exact specification
@RobinTF Thanks a lot for pointing that out. I wasn't aware of that. Then we should indeed fix the REGEX.
The WDQS indeed accepts such variable names, here are two example queries: https://w.wiki/Atfv (German umlaut) and https://w.wiki/AuUM (Devanagari)
The following query fails:
with:
test.nt
test.wordsfile.tsv
test.docsfile.tsv