dpriskorn / LexSrt

The purpose of this script is to get all the senses for all the words in a SRT-file from Wikidata
GNU General Public License v3.0
3 stars 1 forks source link

Improvement: Find all forms and cache in a database to increase lookup speed #17

Open tuukka opened 11 months ago

tuukka commented 11 months ago

Currently, when the input is long, LexSrt can be slow as it makes as many Sparql queries as there are tokens in the input.

As an example, Ordia combines all the input into one (long) Sparql query using the following syntax for the input:

VALUES ?word { "fruit"@en "flies"@en "like"@en "bananas"@en }

Also, spacy_token_to_forms would become spacy_tokens_to_forms to accept all the tokens in one call.

The tokens won't have a single lexical category, so that part of the Sparql query would need to change too.

Option 1: It's possible to write the input for both variables like this:

SELECT DISTINCT ?representation ?lexical_category ?form {
    VALUES ( ?representation ?lexical_category ) {
        ( "flies"@en wd:Q1084 )
        ( "like"@en wd:Q24905 )
    }
    ?lexeme dct:language wd:Q1860 ;
            wikibase:lexicalCategory / wdt:P279* ?lexical_category ;
            ontolex:lexicalForm ?form .
    ?form ontolex:representation ?representation .
}

Option 2: I think it might be better to move the filtering by lexical category to be outside of the Sparql query (but this can be a follow-up issue). This way, you can find results even when spaCy has guessed the lexical category wrong. (After the Sparql query returns results, you can filter out duplicate lexemes where the lexical category doesn't match.)

SELECT DISTINCT ?representation ?lexical_category ?form {
    VALUES ?representation { "flies"@en "like"@en }
    ?lexeme dct:language wd:Q1860 ;
            wikibase:lexicalCategory ?lexical_category ;
            ontolex:lexicalForm ?form .
    ?form ontolex:representation ?representation .
}
dpriskorn commented 11 months ago

One advantage of the one query per word is that it is easier pn the query service because it can be cached.

That probably does not outweigh the disadvantage of slowness for most users.

One possibility is to have a daily updated local database for lexsrt of all forms.

Then we could create another endpoint to force cache invalidation if needed.

That would solve the issue of speed and hitting WDQS only once per day in total which is way better and cheaper from a WMF perspective.

dpriskorn commented 11 months ago

A local database could be made using PostgreSQL or SQLite. It would have a form table with the following columns

WMF/spaCy lang code representation category qid form id

Index on lang representation category

LexSrt then use this table for all lookups which should be blazingly fast

tuukka commented 11 months ago

I like the database idea - you may want to rename the issue "Find all lexemes ~with one Sparql query~ with zero Sparql queries" :wink:

I think using an SQLite library would be simpler and faster than a PostgreSQL service. I downloaded all the data in CSV format and it's 31 megabytes so it will fit in memory. Link to the download query: https://qlever.cs.uni-freiburg.de/wikidata/VjHnlM

fnielsen commented 10 months ago

Federating with SERVICE wikibase:mwapi may also be an option.