Integrate with Ordia's text-to-lexemes

Daniel-Mietchen commented 4 years ago

What I have in mind here is things like checking whether all the lexemes in X are in Wikidata, where X might be a list of

publications (by someone, in a given venue, on a given topic etc.)
publication venues
events
event venues
awards etc.

Daniel-Mietchen commented 4 years ago

X might as well be - perhaps with some pre-processing to enhance performance and avoid copyright issues - something else than suggested above, e.g.

a Wikipedia article on the topic
a full text, e.g. from PMC
an abstract, e.g. from PubMed

Daniel-Mietchen commented 4 years ago

I took the "Extraction" query from https://tools.wmflabs.org/ordia/text-to-lexemes and combined it with the title word extraction query from https://www.wikidata.org/wiki/Wikidata:University_of_Virginia/Listeria/UVa_people/Common_words_in_titles_of_UVA-coauthored_publications_without_P921_(main_subject)_statement .

I could not get the lexeme part to work yet, but the basic parts of the pipeline work with this query:


SELECT DISTINCT
  ?word ?wordUrl
  ?form ?formLabel
  ?lexeme ?lexemeLabel
  ?lexical_category ?lexical_categoryLabel
  (GROUP_CONCAT(DISTINCT ?featureLabel; separator=" // ") AS ?features)
  ?sense ?senseLabel
  (IRI(CONCAT("https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file&width=100&wpvalue=", 
          SUBSTR(STR(SAMPLE(?images)), 52))) AS ?sense_image)
WHERE {
  {
    SELECT DISTINCT ?x ?title WHERE {
      ?x wdt:P921 wd:Q202864 ;
         wdt:P1476 ?title.
      FILTER(STRLEN(?title) >= 6)
    }
    LIMIT 100
  }
  BIND(LCASE(?title) AS ?ltitle)
  BIND(REPLACE(?ltitle, "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w1)
  BIND(REPLACE(STRAFTER(?ltitle, ?w1), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w2)
  BIND(REPLACE(STRAFTER(?ltitle, ?w2), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w3)
  VALUES ?w_ { 1 2 3 }
  BIND(IF(?w_ = 1, ?w1, IF(?w_ = 2, ?w2, ?w3)) AS ?word)
  FILTER(REGEX(?word, "^\\w+$")) # since ?w may evaluate to an empty string, e.g. for one-word titles

  FILTER (LANG(?word) = "en")

  OPTIONAL {
    ?form ontolex:representation ?word . 
    OPTIONAL {
      ?form wikibase:grammaticalFeature ?feature .
      BIND(STR(?feature) AS ?default_featureLabel)
      OPTIONAL {
        ?feature rdfs:label ?featureLabel_ .
        FILTER (LANG(?featureLabel_) = "en")
      }
      BIND(COALESCE(?featureLabel_, ?default_featureLabel) AS ?featureLabel)
    }
    ?form ontolex:representation ?formLabel .

  }
  BIND(IF(BOUND(?form), "", CONCAT("search?language=en&q=", ?word)) AS ?wordUrl)

}
GROUP BY
  ?word ?wordUrl ?form ?formLabel
  ?lexeme ?lexemeLabel ?lexical_category ?lexical_categoryLabel
  ?sense ?senseLabel
ORDER BY ?word

Daniel-Mietchen commented 4 years ago

The full query (i.e. with the lexeme part) that I tried and could not get to work is here:


SELECT DISTINCT
  ?word ?wordUrl
  ?form ?formLabel
  ?lexeme ?lexemeLabel
  ?lexical_category ?lexical_categoryLabel
  (GROUP_CONCAT(DISTINCT ?featureLabel; separator=" // ") AS ?features)
  ?sense ?senseLabel
  (IRI(CONCAT("https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file&width=100&wpvalue=", 
          SUBSTR(STR(SAMPLE(?images)), 52))) AS ?sense_image)
WHERE {
  {
    SELECT DISTINCT ?x ?title WHERE {
      ?x wdt:P921 wd:Q202864 ;
         wdt:P1476 ?title.
      FILTER(STRLEN(?title) >= 6)
    }
    LIMIT 100
  }
  BIND(LCASE(?title) AS ?ltitle)
  BIND(REPLACE(?ltitle, "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w1)
  BIND(REPLACE(STRAFTER(?ltitle, ?w1), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w2)
  BIND(REPLACE(STRAFTER(?ltitle, ?w2), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w3)
  VALUES ?w_ { 1 2 3 }
  BIND(IF(?w_ = 1, ?w1, IF(?w_ = 2, ?w2, ?w3)) AS ?word)
  FILTER(REGEX(?word, "^\\w+$")) # since ?w may evaluate to an empty string, e.g. for one-word titles

  FILTER (LANG(?word) = "en")

  OPTIONAL {
    ?form ontolex:representation ?word . 
    OPTIONAL {
      ?form wikibase:grammaticalFeature ?feature .
      BIND(STR(?feature) AS ?default_featureLabel)
      OPTIONAL {
        ?feature rdfs:label ?featureLabel_ .
        FILTER (LANG(?featureLabel_) = "en")
      }
      BIND(COALESCE(?featureLabel_, ?default_featureLabel) AS ?featureLabel)
    }
    ?form ontolex:representation ?formLabel .

    ?lexeme wikibase:lexicalCategory ?lexical_category .
    BIND(STR(?lexical_category) AS ?default_lexical_categoryLabel)
    OPTIONAL {
      ?lexical_category rdfs:label ?lexical_categoryLabel_ .
      FILTER (LANG(?lexical_categoryLabel_) = 'en')
    }
    BIND(COALESCE(?lexical_categoryLabel_, ?default_lexical_categoryLabel) AS
     ?lexical_categoryLabel)

    ?lexeme ontolex:lexicalForm ?form .

    ?lexeme wikibase:lemma ?lexemeLabel .

    OPTIONAL {
      ?lexeme ontolex:sense ?sense .
      BIND(SUBSTR(STR(?sense), 32) AS ?senseLabel)
      OPTIONAL {
        ?sense wdt:P18 ?images .
      }
    }

  }
  BIND(IF(BOUND(?form), "", CONCAT("search?language=en&q=", ?word)) AS ?wordUrl)

}
GROUP BY
  ?word ?wordUrl ?form ?formLabel
  ?lexeme ?lexemeLabel ?lexical_category ?lexical_categoryLabel
  ?sense ?senseLabel
ORDER BY ?word

Daniel-Mietchen commented 4 years ago

I left a note at https://www.wikidata.org/w/index.php?title=Wikidata:Request_a_query&oldid=1130905588#Getting_the_lexeme_for_a_given_form .

Daniel-Mietchen commented 4 years ago

Got some great responses there. While not precisely what I was after, I think they can help in other ways.

Daniel-Mietchen commented 4 years ago

Finally got the problematic query to work by making some more clauses OPTIONAL :

################
# Checking whether strings from the titles of publications already exist as lexemes
# The query has three parts:
#   I - get a list of publications on a given topic
#  II - extract strings from the titles
# III - check whether these strings exist as Wikidata lexemes
################

SELECT DISTINCT
  ?word ?wordUrl
  ?form ?formLabel
  ?lexeme ?lexemeLabel
  ?lexical_category ?lexical_categoryLabel
  (GROUP_CONCAT(DISTINCT ?featureLabel; separator=" // ") AS ?features)
  ?sense ?senseLabel
  (IRI(CONCAT("https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file&width=100&wpvalue=", 
          SUBSTR(STR(SAMPLE(?images)), 52))) AS ?sense_image)
WHERE {

#   I - get a list of publications on a given topic

  {
    SELECT DISTINCT ?x ?title WHERE {
      ?x wdt:P921 wd:Q202864 ;  # Zika virus
         wdt:P1476 ?title.
      FILTER(STRLEN(?title) >= 6)
    }
    LIMIT 200
  }

#  II - extract strings from the titles

  BIND(LCASE(?title) AS ?ltitle)
  BIND(REPLACE(?ltitle, "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w1)
  BIND(REPLACE(STRAFTER(?ltitle, ?w1), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w2)
  BIND(REPLACE(STRAFTER(?ltitle, ?w2), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w3)
  VALUES ?w_ { 1 2 3 }
  BIND(IF(?w_ = 1, ?w1, IF(?w_ = 2, ?w2, ?w3)) AS ?word)
  FILTER(REGEX(?word, "^\\w+$")) # since ?w may evaluate to an empty string, e.g. for one-word titles

  FILTER (LANG(?word) = "en")

# III - check whether these strings exist as Wikidata lexemes
# This part is taken from https://tools.wmflabs.org/ordia/text-to-lexemes

  OPTIONAL {
    ?form ontolex:representation ?word . 
    OPTIONAL {
      ?form wikibase:grammaticalFeature ?feature .
      BIND(STR(?feature) AS ?default_featureLabel)
      OPTIONAL {
        ?feature rdfs:label ?featureLabel_ .
        FILTER (LANG(?featureLabel_) = "en")
      }
      BIND(COALESCE(?featureLabel_, ?default_featureLabel) AS ?featureLabel)
    }
    ?form ontolex:representation ?formLabel .

    OPTIONAL { ?lexeme ontolex:lexicalForm ?form . }

    OPTIONAL { ?lexeme wikibase:lexicalCategory ?lexical_category .
    BIND(STR(?lexical_category) AS ?default_lexical_categoryLabel)
    OPTIONAL {
      ?lexical_category rdfs:label ?lexical_categoryLabel_ .
      FILTER (LANG(?lexical_categoryLabel_) = 'en')
    }
    BIND(COALESCE(?lexical_categoryLabel_, ?default_lexical_categoryLabel) AS
     ?lexical_categoryLabel)
    ?lexeme wikibase:lemma ?lexemeLabel .
              }

    OPTIONAL {
      ?lexeme ontolex:sense ?sense .
      BIND(SUBSTR(STR(?sense), 32) AS ?senseLabel)
      OPTIONAL {
        ?sense wdt:P18 ?images .
      }
    }

  }
  BIND(IF(BOUND(?form), "", CONCAT("search?language=en&q=", ?word)) AS ?wordUrl)
}
GROUP BY
  ?word ?wordUrl ?form ?formLabel
  ?lexeme ?lexemeLabel ?lexical_category ?lexical_categoryLabel
  ?sense ?senseLabel
ORDER BY ?word

For our use cases, it might actually make sense to split the functionality such that there is one query for existing lexemes and one for non-existing ones, which are added in below:

query for existing lexemes

################
# Checking whether strings from the titles of publications already exist as lexemes
# The query has three parts:
#   I - get a list of publications on a given topic
#  II - extract strings from the titles
# III - check whether these strings exist as Wikidata lexemes
################

SELECT DISTINCT
  ?word 
  ?form ?formLabel
  ?lexeme ?lexemeLabel
  ?lexical_category ?lexical_categoryLabel
  (GROUP_CONCAT(DISTINCT ?featureLabel; separator=" // ") AS ?features)
  ?sense ?senseLabel
  (IRI(CONCAT("https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file&width=100&wpvalue=", 
          SUBSTR(STR(SAMPLE(?images)), 52))) AS ?sense_image)
WHERE {

#   I - get a list of publications on a given topic

  {
    SELECT DISTINCT ?x ?title WHERE {
      ?x wdt:P921 wd:Q202864 ;  # Zika virus
         wdt:P1476 ?title.
      FILTER(STRLEN(?title) >= 6)
    }
    LIMIT 200
  }

#  II - extract strings from the titles

  BIND(LCASE(?title) AS ?ltitle)
  BIND(REPLACE(?ltitle, "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w1)
  BIND(REPLACE(STRAFTER(?ltitle, ?w1), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w2)
  BIND(REPLACE(STRAFTER(?ltitle, ?w2), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w3)
  VALUES ?w_ { 1 2 3 }
  BIND(IF(?w_ = 1, ?w1, IF(?w_ = 2, ?w2, ?w3)) AS ?word)
  FILTER(REGEX(?word, "^\\w+$")) # since ?w may evaluate to an empty string, e.g. for one-word titles

  FILTER (LANG(?word) = "en")

# III - check whether these strings exist as Wikidata lexemes
# This part is taken from https://tools.wmflabs.org/ordia/text-to-lexemes

  OPTIONAL {
    ?form ontolex:representation ?word . 
    OPTIONAL {
      ?form wikibase:grammaticalFeature ?feature .
      BIND(STR(?feature) AS ?default_featureLabel)
      OPTIONAL {
        ?feature rdfs:label ?featureLabel_ .
        FILTER (LANG(?featureLabel_) = "en")
      }
      BIND(COALESCE(?featureLabel_, ?default_featureLabel) AS ?featureLabel)
    }
    ?form ontolex:representation ?formLabel .

    OPTIONAL { ?lexeme ontolex:lexicalForm ?form . }

    OPTIONAL { ?lexeme wikibase:lexicalCategory ?lexical_category .
    BIND(STR(?lexical_category) AS ?default_lexical_categoryLabel)
    OPTIONAL {
      ?lexical_category rdfs:label ?lexical_categoryLabel_ .
      FILTER (LANG(?lexical_categoryLabel_) = 'en')
    }
    BIND(COALESCE(?lexical_categoryLabel_, ?default_lexical_categoryLabel) AS
     ?lexical_categoryLabel)
    ?lexeme wikibase:lemma ?lexemeLabel .
              }

    OPTIONAL {
      ?lexeme ontolex:sense ?sense .
      BIND(SUBSTR(STR(?sense), 32) AS ?senseLabel)
      OPTIONAL {
        ?sense wdt:P18 ?images .
      }
    }

  }
  FILTER(BOUND(?form))
}
GROUP BY
  ?word ?form ?formLabel
  ?lexeme ?lexemeLabel ?lexical_category ?lexical_categoryLabel
  ?sense ?senseLabel
ORDER BY ?word

query for strings that do not correspond to an existing Wikidata lexeme:


################
# Checking whether strings from the titles of publications already exist as lexemes
# The query has three parts:
#   I - get a list of publications on a given topic
#  II - extract strings from the titles
# III - check whether these strings exist as Wikidata lexemes
################

SELECT DISTINCT
  ?word ?wordUrl
WHERE {

#   I - get a list of publications on a given topic

  {
    SELECT DISTINCT ?x ?title WHERE {
      ?x wdt:P921 wd:Q202864 ;  # Zika virus
         wdt:P1476 ?title.
      FILTER(STRLEN(?title) >= 6)
    }
    LIMIT 200
  }

#  II - extract strings from the titles

  BIND(LCASE(?title) AS ?ltitle)
  BIND(REPLACE(?ltitle, "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w1)
  BIND(REPLACE(STRAFTER(?ltitle, ?w1), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w2)
  BIND(REPLACE(STRAFTER(?ltitle, ?w2), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w3)
  VALUES ?w_ { 1 2 3 }
  BIND(IF(?w_ = 1, ?w1, IF(?w_ = 2, ?w2, ?w3)) AS ?word)
  FILTER(REGEX(?word, "^\\w+$")) # since ?w may evaluate to an empty string, e.g. for one-word titles

  FILTER (LANG(?word) = "en")

# III - check whether these strings exist as Wikidata lexemes
# This part is taken from https://tools.wmflabs.org/ordia/text-to-lexemes

  OPTIONAL {
    ?form ontolex:representation ?word . 
  }
  FILTER(!BOUND(?form))
  BIND(CONCAT("search?language=en&q=", ?word) AS ?wordUrl)
}
GROUP BY
  ?word ?wordUrl 
ORDER BY ?word

Daniel-Mietchen commented 4 years ago

Played around some more with the queries I got from Request a Query.

The query that counts occurrences of certain elements in a string is rather quick and can be used to get things like

distribution of number of words in titles
- of publications
- on a given topic
- by people affiliated with a particular institution
- etc.
- of all clinical trials with a NCT ID
- etc.
distribution of number of branches in chemical compounds, as derived from their canonical SMILES
etc.

The query that checks words in publication titles as to whether they exist as Wikidata forms and lexemes has lots of possibilities too, e.g.

Popular strings in titles of works with a given author name string
Word cloud of lexemes of journals by a publisher that do not have main subject statements
- such word clouds could be handy on a number of Scholia profiles or their missing pages, e.g.
- for institutions
- for topics
- for clinical trials
  - here, I was experimenting further with the filtering and ranking, so as to mimic things like TF/IDF to tease out signal from rarer stuff
- need to think of some structured way to handle stop words
Ordia text-to-lexemes call for lexemes extracted from titles of publications by DTU researchers

Daniel-Mietchen commented 4 years ago

That latter query has a part involving a string of multiples of " Z Z". I am thinking about replacing that with some representation of that as a binary number, Base32 or similar.

Daniel-Mietchen commented 4 years ago

I set up Q87572880 which has a numerical value that can be represented as a sequence of 50 repetitions of "10".

Daniel-Mietchen commented 4 years ago

Before setting up Q87572880, I tried to use Q87513951, whose binary representation has the same string as the decimal one of Q87572880.

Daniel-Mietchen commented 4 years ago

Here is the "Word cloud for topic" query from above, using the numeric value of Q87572880.

Daniel-Mietchen commented 4 years ago

A variant of the query for COVID-19.

Daniel-Mietchen commented 2 years ago

I received an inquiry regarding the purpose of the item Q87572880, as per its talk page.

WDscholia / scholia

Integrate with Ordia's text-to-lexemes #955