Closed fnielsen closed 3 years ago
Perhaps an SPARQL-based approach used like in Ordia would be better to use than the current shaky Python-based approach that is used now. The Ordia approach is for the text-to-lexemes https://ordia.toolforge.org/text-to-lexemes but cannot handle n-grams at the moment, so multiword phrases is an issue. This could be changed with another tokenizer, but what about long phrases such as "functional magnetic resonance imaging"? hmmm...
fMRI currently works: https://scholia.toolforge.org/text-to-topics?text=functional magnetic resonance imaging
The simple query takes 6 or 18 seconds:
SELECT DISTINCT ?topic WHERE {
[]
# Disabled because of performance
# wdt:P31 wd:Q13442814 ;
wdt:P921 ?topic .
}
It returns 834,371 results.
A GROUP BY
query times unfortunately out:
SELECT (COUNT(*) AS ?count) ?topic WHERE {
[]
# Disabled because of performance
# wdt:P31 wd:Q13442814 ;
wdt:P921 ?topic .
}
GROUP BY ?topic
HAVING(?count > 100)
A GROUP BY
without HAVING
also times out:
SELECT ?topic WHERE {
[]
# Disabled because of performance
# wdt:P31 wd:Q13442814 ;
wdt:P921 ?topic .
}
GROUP BY ?topic
This version where some works are sample works
SELECT ?topic ?topic_label
WITH {
# Find works with a topic
SELECT ?work {
?work wdt:P31 wd:Q13442814 ;
wdt:P921 [] .
}
# The arbitratry limit here is to avoid timeout
LIMIT 200000
} AS %works
WITH {
SELECT (COUNT(?work) AS ?count) ?topic WHERE {
INCLUDE %works
?work wdt:P921 ?topic .
}
GROUP BY ?topic
HAVING(?count > 1)
} AS %topics
WHERE {
INCLUDE %topics
?topic rdfs:label ?topic_label_ . # | skos:altLabel
FILTER(LANG(?topic_label_) = 'en')
BIND(LCASE(?topic_label_) AS ?topic_label)
}
There are only 4,164 topics. Took 24 seconds
DISTINCT instead of GROUP BY works apparently faster - 19 seconds
SELECT ?topic ?topic_label
WITH {
# Find works with a topic
SELECT ?work {
?work wdt:P31 wd:Q13442814 ;
wdt:P921 [] .
}
# The arbitratry limit here is to avoid timeout
LIMIT 200000
} AS %works
WITH {
SELECT DISTINCT ?topic WHERE {
INCLUDE %works
?work wdt:P921 ?topic .
}
} AS %topics
WHERE {
INCLUDE %topics
?topic rdfs:label ?topic_label_ . # | skos:altLabel
FILTER(LANG(?topic_label_) = 'en')
BIND(LCASE(?topic_label_) AS ?topic_label)
}
This results in 9,771 results
text-to-topics does not test well, so that a tox execution results in
scholia/text.py:157: KeyError
which hinders the local test of Scholia and PR #1531