Wikidata-based scholarly profiles
text-to-topics does not test well #1645

fnielsen commented 3 years ago

text-to-topics does not test well, so that a tox execution results in scholia/text.py:157: KeyError which hinders the local test of Scholia and PR #1531

fnielsen commented 3 years ago

Perhaps an SPARQL-based approach used like in Ordia would be better to use than the current shaky Python-based approach that is used now. The Ordia approach is for the text-to-lexemes https://ordia.toolforge.org/text-to-lexemes but cannot handle n-grams at the moment, so multiword phrases is an issue. This could be changed with another tokenizer, but what about long phrases such as "functional magnetic resonance imaging"? hmmm...

fnielsen commented 3 years ago

Ordia's tokenizer

fnielsen commented 3 years ago

fMRI currently works: https://scholia.toolforge.org/text-to-topics?text=functional magnetic resonance imaging

fnielsen commented 3 years ago

The simple query takes 6 or 18 seconds:


    # Disabled because of performance
    # wdt:P31 wd:Q13442814 ;

    wdt:P921 ?topic .

It returns 834,371 results.

A GROUP BY query times unfortunately out:

SELECT (COUNT(*) AS ?count) ?topic WHERE {

    # Disabled because of performance
    # wdt:P31 wd:Q13442814 ;

    wdt:P921 ?topic .
GROUP BY ?topic
HAVING(?count > 100)

A GROUP BY without HAVING also times out:


    # Disabled because of performance
    # wdt:P31 wd:Q13442814 ;

    wdt:P921 ?topic .
GROUP BY ?topic
fnielsen commented 3 years ago

This version where some works are sample works

SELECT ?topic ?topic_label
  # Find works with a topic
  SELECT ?work {
    ?work wdt:P31 wd:Q13442814 ;
          wdt:P921 [] .
  # The arbitratry limit here is to avoid timeout
  LIMIT 200000
} AS %works
  SELECT (COUNT(?work) AS ?count) ?topic WHERE {
    INCLUDE %works
    ?work wdt:P921 ?topic .
  GROUP BY ?topic
  HAVING(?count > 1)
} AS %topics
  INCLUDE %topics
  ?topic rdfs:label ?topic_label_ . # | skos:altLabel 
  FILTER(LANG(?topic_label_) = 'en')
  BIND(LCASE(?topic_label_) AS ?topic_label)

There are only 4,164 topics. Took 24 seconds

fnielsen commented 3 years ago

DISTINCT instead of GROUP BY works apparently faster - 19 seconds

SELECT ?topic ?topic_label
  # Find works with a topic
  SELECT ?work {
    ?work wdt:P31 wd:Q13442814 ;
          wdt:P921 [] .
  # The arbitratry limit here is to avoid timeout
  LIMIT 200000
} AS %works
    INCLUDE %works
    ?work wdt:P921 ?topic .
} AS %topics
  INCLUDE %topics
  ?topic rdfs:label ?topic_label_ . # | skos:altLabel 
  FILTER(LANG(?topic_label_) = 'en')
  BIND(LCASE(?topic_label_) AS ?topic_label)

This results in 9,771 results