Open tuukka opened 11 months ago
One advantage of the one query per word is that it is easier pn the query service because it can be cached.
That probably does not outweigh the disadvantage of slowness for most users.
One possibility is to have a daily updated local database for lexsrt of all forms.
Then we could create another endpoint to force cache invalidation if needed.
That would solve the issue of speed and hitting WDQS only once per day in total which is way better and cheaper from a WMF perspective.
A local database could be made using PostgreSQL or SQLite. It would have a form table with the following columns
WMF/spaCy lang code representation category qid form id
Index on lang representation category
LexSrt then use this table for all lookups which should be blazingly fast
I like the database idea - you may want to rename the issue "Find all lexemes ~with one Sparql query~ with zero Sparql queries" :wink:
I think using an SQLite library would be simpler and faster than a PostgreSQL service. I downloaded all the data in CSV format and it's 31 megabytes so it will fit in memory. Link to the download query: https://qlever.cs.uni-freiburg.de/wikidata/VjHnlM
Federating with SERVICE wikibase:mwapi
may also be an option.
Currently, when the input is long, LexSrt can be slow as it makes as many Sparql queries as there are tokens in the input.
As an example, Ordia combines all the input into one (long) Sparql query using the following syntax for the input:
Also,
spacy_token_to_forms
would becomespacy_tokens_to_forms
to accept all the tokens in one call.The tokens won't have a single lexical category, so that part of the Sparql query would need to change too.
Option 1: It's possible to write the input for both variables like this:
Option 2: I think it might be better to move the filtering by lexical category to be outside of the Sparql query (but this can be a follow-up issue). This way, you can find results even when spaCy has guessed the lexical category wrong. (After the Sparql query returns results, you can filter out duplicate lexemes where the lexical category doesn't match.)