WDscholia / scholia

Wikidata-based scholarly profiles
https://scholia.toolforge.org
Other
222 stars 79 forks source link

text-to-topics does not test well #1645

Closed fnielsen closed 3 years ago

fnielsen commented 3 years ago

text-to-topics does not test well, so that a tox execution results in scholia/text.py:157: KeyError which hinders the local test of Scholia and PR #1531

fnielsen commented 3 years ago

Perhaps an SPARQL-based approach used like in Ordia would be better to use than the current shaky Python-based approach that is used now. The Ordia approach is for the text-to-lexemes https://ordia.toolforge.org/text-to-lexemes but cannot handle n-grams at the moment, so multiword phrases is an issue. This could be changed with another tokenizer, but what about long phrases such as "functional magnetic resonance imaging"? hmmm...

fnielsen commented 3 years ago

Ordia's tokenizer

fnielsen commented 3 years ago

fMRI currently works: https://scholia.toolforge.org/text-to-topics?text=functional magnetic resonance imaging

fnielsen commented 3 years ago

The simple query takes 6 or 18 seconds:

SELECT DISTINCT ?topic WHERE {
    []

    # Disabled because of performance
    # wdt:P31 wd:Q13442814 ;

    wdt:P921 ?topic .
  }

It returns 834,371 results.

A GROUP BY query times unfortunately out:

SELECT (COUNT(*) AS ?count) ?topic WHERE {
   []

    # Disabled because of performance
    # wdt:P31 wd:Q13442814 ;

    wdt:P921 ?topic .
}
GROUP BY ?topic
HAVING(?count > 100)

A GROUP BY without HAVING also times out:

SELECT ?topic WHERE {
    []

    # Disabled because of performance
    # wdt:P31 wd:Q13442814 ;

    wdt:P921 ?topic .
  }
GROUP BY ?topic
fnielsen commented 3 years ago

This version where some works are sample works

SELECT ?topic ?topic_label
WITH {
  # Find works with a topic
  SELECT ?work {
    ?work wdt:P31 wd:Q13442814 ;
          wdt:P921 [] .
  }
  # The arbitratry limit here is to avoid timeout
  LIMIT 200000
} AS %works
WITH {
  SELECT (COUNT(?work) AS ?count) ?topic WHERE {
    INCLUDE %works
    ?work wdt:P921 ?topic .
  }
  GROUP BY ?topic
  HAVING(?count > 1)
} AS %topics
WHERE {
  INCLUDE %topics
  ?topic rdfs:label ?topic_label_ . # | skos:altLabel 
  FILTER(LANG(?topic_label_) = 'en')
  BIND(LCASE(?topic_label_) AS ?topic_label)
}

There are only 4,164 topics. Took 24 seconds

fnielsen commented 3 years ago

DISTINCT instead of GROUP BY works apparently faster - 19 seconds

SELECT ?topic ?topic_label
WITH {
  # Find works with a topic
  SELECT ?work {
    ?work wdt:P31 wd:Q13442814 ;
          wdt:P921 [] .
  }
  # The arbitratry limit here is to avoid timeout
  LIMIT 200000
} AS %works
WITH {
  SELECT DISTINCT ?topic WHERE {
    INCLUDE %works
    ?work wdt:P921 ?topic .
  }
} AS %topics
WHERE {
  INCLUDE %topics
  ?topic rdfs:label ?topic_label_ . # | skos:altLabel 
  FILTER(LANG(?topic_label_) = 'en')
  BIND(LCASE(?topic_label_) AS ?topic_label)
}

This results in 9,771 results