Custom completion queries

jbuerklin commented 4 years ago

Changes Backend settings such that whole queries for autocompletions can be entered from start to end, instead of only defining some blocks that are then filled into predefined SELECT { ... } GROUP BY statements.
This branch needs a python manage.py migrate if you're switching here from master.

Note though, that this will render your qleverui.sqlite3 unusable in the master branch.
You should therefore create a backup of your qleverui.sqlite3 before migrating
You could also start fresh with an empty database and only import the example settings.

Importing the example settings In order to get a quick first impression, I advise you to import our example settings by logging in to /admin/, clicking on Backends / Examples / Prefixes and importing the respective *-sample.csv file. The example files already implement the new Backend settings.

What has changed?
When editing a backend, the three settings

Suggest subjects clause
Suggest predicates clause
Suggest objects clause

now accept whole SPARQL queries as input. These queries will be executed when retrieving completions.
In order to make this work, we needed to introduce some kind of template syntax that would make it possible to factor in the user's current query context for each query.
To explain this syntax, we'll have a look at the Suggest objects clause as it is used in the sample settings:

%PREFIXES%
SELECT ?qleverui_entity (SAMPLE(?qleverui_name) as ?qleverui_name) (SAMPLE(?qleverui_altname) as ?qleverui_altname) (SAMPLE(?qleverui_count) as ?qleverui_count) WHERE {
# IF !CURRENT_WORD_EMPTY #
  {
  {
# ENDIF #
    {
      SELECT ?qleverui_entity (COUNT(?qleverui_entity) AS ?qleverui_count) WHERE {
        %CONNECTED_LINES%
        %CURRENT_SUBJECT% %CURRENT_PREDICATE% ?qleverui_entity .
      }
      GROUP BY ?qleverui_entity
# IF !CURRENT_WORD_EMPTY #
      HAVING regex(?qleverui_entity, '^%<CURRENT_WORD%')
# ENDIF #
    }
    OPTIONAL {
      ?qleverui_entity @en@<http://www.w3.org/2000/01/rdf-schema#label> ?qleverui_name .
    }
    OPTIONAL {
      ?qleverui_entity @en@<http://www.w3.org/2004/02/skos/core#altLabel> ?qleverui_altname .
    }
  }
# IF !CURRENT_WORD_EMPTY #
  UNION
  {
    {
      SELECT ?qleverui_entity (COUNT(?qleverui_entity) AS ?qleverui_count) WHERE {
        %CONNECTED_LINES%
        %CURRENT_SUBJECT% %CURRENT_PREDICATE% ?qleverui_entity .
      }
      GROUP BY ?qleverui_entity
    }
    ?qleverui_entity @en@<http://www.w3.org/2000/01/rdf-schema#label> ?qleverui_name .
    FILTER regex(?qleverui_name, '^"%CURRENT_WORD%')
  }
  }
  UNION
  {
    {
      SELECT ?qleverui_entity (COUNT(?qleverui_entity) AS ?qleverui_count) WHERE {
        %CONNECTED_LINES%
        %CURRENT_SUBJECT% %CURRENT_PREDICATE% ?qleverui_entity .
      }
      GROUP BY ?qleverui_entity
    }
    ?qleverui_entity @en@<http://www.w3.org/2004/02/skos/core#altLabel> ?qleverui_altname .
    FILTER regex(?qleverui_altname, '^"%CURRENT_WORD%')
    OPTIONAL {
      ?qleverui_entity @en@<http://www.w3.org/2000/01/rdf-schema#label> ?qleverui_name .
    }
  }
}
# ENDIF #
GROUP BY ?qleverui_entity
ORDER BY DESC(?qleverui_count)

1. %CURRENT_SUBJECT%, %CURRENT_PREDICATE% and %CURRENT_WORD%
The current line of the query the user is typing will be split into these placeholders.
Examples:
current line: ?c wdt:P31 coun[cursor]
%CURRENT_SUBJECT% = ?c %CURRENT_PREDICATE% = wdt:P31 %CURRENT_WORD% = coun

current line: ?c inst[cursor]
%CURRENT_SUBJECT% = ?c %CURRENT_PREDICATE% = inst %CURRENT_WORD% = inst

current line: ?c[cursor]
%CURRENT_SUBJECT% = ?c %CURRENT_PREDICATE% = [not defined] %CURRENT_WORD% = ?c

2. %<CURRENT_WORD%
Same as %CURRENT_WORD%, but prepends a < if %CURRENT_WORD% doesn't start with < or "
Can be helpful in combination with HAVING and KBs such as FreebaseEasy where you don't want to always type the < in order for autocompletion to work.

3. # IF #, # ELSE # and # ENDIF #
Can be used to alter the completion query depending on the users current input.
Text inside an # IF # or # ELSE # block will be ignored if the given condition is not satisified.
Defining an # ELSE # block is optional.
IF / ELSE / ENDIF statements can be nested.

4. Conditions Available conditions for # IF # statements are as follows:

CURRENT_WORD_EMPTY : true if the user hasn't startet typing a new word
CURRENT_SUBJECT_VARIABLE : true if %CURRENT_SUBJECT% is a variable
CURRENT_PREDICATE_VARIABLE : true if %CURRENT_PREDICATE% is a variable
CONNECTED_TRIPLES_EMPTY : true if %CONNECTED_TRIPLES% is empty

These conditions can be combined into logical expressions of arbitraty length using

OR - logical or (binds weakest)
AND - logical and (binds stronger than OR)
! - negation (binds stronger than AND)

Example:

# IF !CURRENT_WORD_EMPTY OR CURRENT_SUBJECT_VARIABLE AND CURRENT_PREDICATE_VARIABLE #
    # Text inside this block will be used for the query if the condition above evaluates to true
    [...]
# ELSE #
    # Text inside this block will be used otherwise
    [...]
# ENDIF #

5. %PREFIXES%
Inserts the prefix declarations the user has made.

6. %CONNECTED_TRIPLES%
Inserts the lines of the user's query that are connected to %CURRENT_WORD%

Further hints

Do not use LIMIT and OFFSET to limit the result. QLeverUI will do this by itself.
Currently, the UI expects the variable containing the suggestions to always be first in the SELECT clause (like above: SELECT ?qleverui_entity [...]). It does not need to be named ?qleverui_entity though.
The variables for canonical and alternative names for suggestions, however, must be called ?qleverui_name and ?qleverui_altname. Their position in the SELECT clause does not matter.

These last two restrictions will be changed in the future.

What has not changed All the settings in the Showing names category have not been changed. These are now only needed for the tooltips when hovering the mouse over an entity.
It stands to question whether they can stay the way they are or need to be changed to be more customizable, too.
The Alternative [...] name clause settings are not needed anymore and will be removed later.

hannahbast commented 4 years ago

Thank you very much for this pull request. I have played around with it quite a bit now. Overall, it works! I have found a few glitches, which were already in the previous code. See the respective (minor) issues.

Here are a few more requests, which are important for using this in practice. They should be relatively simple to implement:

Please fix the %CONNECTED_LINES% and rename it to %CONNECTED_TRIPLES%. When you consider the query body as an undirected graph (where each triple is a node and two triples are connected if they share a variable), then %CONNECTED_TRIPLES% is the connected component containing the current triple. This is not a feature request but a bug fix, see also #16
You currently support only one # IF # condition, namely "CURRENT_WORD". First, please revert it's meaning and rename it to CURRENT_WORD_EMPTY. Second, please add the following conditions. They are necessary to capture the complexity of our latest templates for Wikidata:

1.1 CURRENT_SUBJECT_VARIABLE ... true if CURRENT_SUBJECT is a variable 1.2 CURRENT_PREDICATE_VARIABLE ... true if CURRENT_PREDICATE is a variable 1.3 CONNECTED_LINES_EMPTY ... true if %CONNECTED_LINES% is empty, that is, if the connected component of the query graph containing the current triple consists only of that triple 1.4 CONNECTED_LINES_EMPTY_AND_CURRENT_SUBJECT_VARIABLE ... just the logical and of the two respective conditions; of course this would not be needed if arbitrary logical expressions with these conditions were possible, but I think that would be overkill for now. But if you feel that it's relatively easy, go ahead.

Third, for all of these, please also support the negated version. As a syntax, I would suggest !CURRENT_WORD_EMPTY, !CURRENT_SUBJECT_VARIABLE, !CURRENT_PREDICATE_VARIABLE, !CONNECTED_LINES_EMPTY . Given the next item, this is not strictly necessary, but would be very convenient. Again, this would not be needed if arbitrary logical expressions were possible, see the comment in the previous paragraph.
Please also add an # ELSE # construct, so that one can write something like

# IF CURRENT_WORD_EMPTY # ... # ELSE # ... # ENDIF #
Does nesting of these work?

hannahbast commented 4 years ago

Noch eine kleine Bitte:

Die Menge der Suggestions, die immer kommt (aktuell ql:contains-entity und ql:contains-word) sollte auch konfigurierbar sein. Für den aktuellen Index hätte man da nämlich gerne gar nichts, weil es gar keinen Textteil gibt

hannahbast commented 4 years ago

@jbuerklin Not sure whether you got these comments, so trying again with an @ tag

jbuerklin commented 4 years ago

I got all of them, thank you. I'm a bit busy this week but I will get to it tomorrow.

jbuerklin commented 4 years ago

Done
logical expressions might not be too hard. I'll look into it.
Yes, nesting does work.
Done. Needs a python manage.py migrate

jbuerklin commented 4 years ago

Renamed CURRENT_WORD to CURRENT_WORD_EMPTY and reverted its meaning. Implemented CURRENT_SUBJECT_VARIABLE, CURRENT_PREDICATE_VARIABLE, CONNECTED_LINES_EMPTY Negation via ! works for all of them. Adding support for AND and OR should be relatively simple now.
Refactored code such that adding support for # ELSE # shouldn't be too hard as well

hannahbast commented 4 years ago

I should have written #18 here instead of opening a separate issue. It's in the same spirit as the new field for suggestions that are always shown and the checkbox should probably be above that field

jbuerklin commented 4 years ago

Fixed #18 both in master and in this branch. Needs python manage.py migrate

jbuerklin commented 4 years ago

Implemented AND and OR, where AND binds stronger than OR

jbuerklin commented 4 years ago

Added # ELSE # construct and renamed CONNECTED_LINES_EMPTY to CONNECTED_TRIPLES_EMPTY
Also updated my first post with information about IF / ELSE statements, conditions and logical expressions.

hannahbast commented 4 years ago

Thanks a lot, it's working great so far! Here is another minor bug:

In line 349 of backend/static/js/codemirror/modes/sparql/sparql-hint.js, prefixes in word (which is the %CURRENT_WORD% for the templates) are expanded to URL prefixes. However these URL prefixes typically contain dots. The typical use of %CURRENT_WORD% is in a prefix regex filter such as FILTER regex(?variable, "^%CURRENT_WORD"). When %CURRENT_WORD% contains dots, these dots have a special meaning in the regex (they match any character, not only a dot). The matches are usually the same, but it prevents the usage of binary search for prefix search in QLever.

Long story short: the . should be escaped. One simple fix that worked for me, was to replace line 349 in backend/static/js/codemirror/modes/sparql/sparql-hint.js as follows. But maybe you have a more principled fix. Note the double escaping of the \

word = '<' + word.replace(key + ':', value);
word = '<' + word.replace(key + ':', value.replace(/./g, "\\."));

hannahbast commented 4 years ago

Here is another minor issue:

The previous version of the UI did not make suggestions when there were no connected triples and no character has been typed yet. In particular, this is the situation at the very beginning of every SPARQL query body.

The current version shows suggestions in this situation, but they don't make too much sense.

I would suggest to add a checkbox to configure whether one wants suggestions in this situation (if yes, one can control which ones via the # IF ... # directives and with the new variables) or not.

jbuerklin commented 4 years ago

Added escaping for %CURRENT_WORD%

The previous version of the UI did not make suggestions when there were no connected triples and no character has been typed yet. In particular, this is the situation at the very beginning of every SPARQL query body.

I'll look into that later

jbuerklin commented 4 years ago

I would suggest to add a checkbox to configure whether one wants suggestions in this situation (if yes, one can control which ones via the # IF ... # directives and with the new variables) or not.

I added that in cc80048. Needs to be python manage.py migrate - ed.
The new option is called Suggest subjects in empty lines.

jbuerklin commented 4 years ago

Should we merge this branch into master now, or are there any major concerns left?

hannahbast commented 3 years ago

We have played around with the templates a lot in the last months. The templates and the template substitution work very well, thank you! Before merging this into the master I have a question about the other fields in the backend configuration and the other modes:

When I select "2. Context-Insensitive suggestions", where do the suggestions from? In our current configuration, the SPARQL query for context-insensitive predicate or object suggestions is simple empty (or, rather, incomplete because there is only the LIMIT 40 OFFSET 0 that is always appended to the end). Maybe this is related to the answers to the next points.
What is the purpose of all the fields in the section "Showing names"? Are these still used for something? If no, the fields should be removed. If yes, there should not be just snippets of SPARQL queries here, but whole SPARQL queries, like in the section "Backend suggestions". And the "Need help?" text should be improved because I currently do not understand it.
The section "Preprocessing" with only field "Source Path" is no longer needed for anything, right? If that is the case, the section should be removed.
What are the entries in the field "Replace predicates in autocompletion context" supposed to do? I tried the replacements suggested unter "Need help?" but that didn't seem to habe any effect when I used one of those predicates (e.g. rdfs:label) in the query.

hannahbast commented 3 years ago

Another issue that is maybe related to this PR and maybe deserves a seperate change is the following:

We realized over the past months that we simply cannot make all context-sensitive suggestions fast enough. Most of them are very fast, but every once in a while one has to wait many seconds and sometimes even a minute or two for the suggestions to come. Sometimes, a suggestion query also fails altogether because it is simply too hard or requires too much memory. Any of these happen frequent enough (at least for Wikidata) that it is quite annoying when using the autocompletion for writing queries. It also gives the misleading feeling that the autocompletion concept is broken.
Since we have both context-sensitive and context-insensitive suggestions, this is actually relatively easy to fix. Namely, there should be a fourth mode, which launches both a context-sensitive and a context-insensitive query. When a certain amount of time has passed (this threshold should be configurable, a good default value is maybe 2 seconds) and the context-insensitive query hasn't finished, the results from the context-insensitive query should be taken instead (as soon as it is there, but these are usually very fast).
1. We are actually about to extend the QLever API by a timeout argument. However, that is not so easy, and it might take much longer than the specified timeout for a query to actually finish and report that it took longer than the threshold. So the timing should actually be done in the UI. If the query takes longer, its output should simply be ignored, whenever it returns.

ad-freiburg / qlever-ui

Custom completion queries #10