WDscholia / scholia

Wikidata-based scholarly profiles
https://scholia.toolforge.org
Other
220 stars 78 forks source link

Integrate with Ordia's text-to-lexemes #955

Open Daniel-Mietchen opened 4 years ago

Daniel-Mietchen commented 4 years ago

What I have in mind here is things like checking whether all the lexemes in X are in Wikidata, where X might be a list of

Daniel-Mietchen commented 4 years ago

X might as well be - perhaps with some pre-processing to enhance performance and avoid copyright issues - something else than suggested above, e.g.

Daniel-Mietchen commented 4 years ago

I took the "Extraction" query from https://tools.wmflabs.org/ordia/text-to-lexemes and combined it with the title word extraction query from https://www.wikidata.org/wiki/Wikidata:University_of_Virginia/Listeria/UVa_people/Common_words_in_titles_of_UVA-coauthored_publications_without_P921_(main_subject)_statement .

I could not get the lexeme part to work yet, but the basic parts of the pipeline work with this query:


SELECT DISTINCT
  ?word ?wordUrl
  ?form ?formLabel
  ?lexeme ?lexemeLabel
  ?lexical_category ?lexical_categoryLabel
  (GROUP_CONCAT(DISTINCT ?featureLabel; separator=" // ") AS ?features)
  ?sense ?senseLabel
  (IRI(CONCAT("https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file&width=100&wpvalue=", 
          SUBSTR(STR(SAMPLE(?images)), 52))) AS ?sense_image)
WHERE {
  {
    SELECT DISTINCT ?x ?title WHERE {
      ?x wdt:P921 wd:Q202864 ;
         wdt:P1476 ?title.
      FILTER(STRLEN(?title) >= 6)
    }
    LIMIT 100
  }
  BIND(LCASE(?title) AS ?ltitle)
  BIND(REPLACE(?ltitle, "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w1)
  BIND(REPLACE(STRAFTER(?ltitle, ?w1), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w2)
  BIND(REPLACE(STRAFTER(?ltitle, ?w2), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w3)
  VALUES ?w_ { 1 2 3 }
  BIND(IF(?w_ = 1, ?w1, IF(?w_ = 2, ?w2, ?w3)) AS ?word)
  FILTER(REGEX(?word, "^\\w+$")) # since ?w may evaluate to an empty string, e.g. for one-word titles

  FILTER (LANG(?word) = "en")

  OPTIONAL {
    ?form ontolex:representation ?word . 
    OPTIONAL {
      ?form wikibase:grammaticalFeature ?feature .
      BIND(STR(?feature) AS ?default_featureLabel)
      OPTIONAL {
        ?feature rdfs:label ?featureLabel_ .
        FILTER (LANG(?featureLabel_) = "en")
      }
      BIND(COALESCE(?featureLabel_, ?default_featureLabel) AS ?featureLabel)
    }
    ?form ontolex:representation ?formLabel .

  }
  BIND(IF(BOUND(?form), "", CONCAT("search?language=en&q=", ?word)) AS ?wordUrl)

}
GROUP BY
  ?word ?wordUrl ?form ?formLabel
  ?lexeme ?lexemeLabel ?lexical_category ?lexical_categoryLabel
  ?sense ?senseLabel
ORDER BY ?word
Daniel-Mietchen commented 4 years ago

The full query (i.e. with the lexeme part) that I tried and could not get to work is here:


SELECT DISTINCT
  ?word ?wordUrl
  ?form ?formLabel
  ?lexeme ?lexemeLabel
  ?lexical_category ?lexical_categoryLabel
  (GROUP_CONCAT(DISTINCT ?featureLabel; separator=" // ") AS ?features)
  ?sense ?senseLabel
  (IRI(CONCAT("https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file&width=100&wpvalue=", 
          SUBSTR(STR(SAMPLE(?images)), 52))) AS ?sense_image)
WHERE {
  {
    SELECT DISTINCT ?x ?title WHERE {
      ?x wdt:P921 wd:Q202864 ;
         wdt:P1476 ?title.
      FILTER(STRLEN(?title) >= 6)
    }
    LIMIT 100
  }
  BIND(LCASE(?title) AS ?ltitle)
  BIND(REPLACE(?ltitle, "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w1)
  BIND(REPLACE(STRAFTER(?ltitle, ?w1), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w2)
  BIND(REPLACE(STRAFTER(?ltitle, ?w2), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w3)
  VALUES ?w_ { 1 2 3 }
  BIND(IF(?w_ = 1, ?w1, IF(?w_ = 2, ?w2, ?w3)) AS ?word)
  FILTER(REGEX(?word, "^\\w+$")) # since ?w may evaluate to an empty string, e.g. for one-word titles

  FILTER (LANG(?word) = "en")

  OPTIONAL {
    ?form ontolex:representation ?word . 
    OPTIONAL {
      ?form wikibase:grammaticalFeature ?feature .
      BIND(STR(?feature) AS ?default_featureLabel)
      OPTIONAL {
        ?feature rdfs:label ?featureLabel_ .
        FILTER (LANG(?featureLabel_) = "en")
      }
      BIND(COALESCE(?featureLabel_, ?default_featureLabel) AS ?featureLabel)
    }
    ?form ontolex:representation ?formLabel .

    ?lexeme wikibase:lexicalCategory ?lexical_category .
    BIND(STR(?lexical_category) AS ?default_lexical_categoryLabel)
    OPTIONAL {
      ?lexical_category rdfs:label ?lexical_categoryLabel_ .
      FILTER (LANG(?lexical_categoryLabel_) = 'en')
    }
    BIND(COALESCE(?lexical_categoryLabel_, ?default_lexical_categoryLabel) AS
     ?lexical_categoryLabel)

    ?lexeme ontolex:lexicalForm ?form .

    ?lexeme wikibase:lemma ?lexemeLabel .

    OPTIONAL {
      ?lexeme ontolex:sense ?sense .
      BIND(SUBSTR(STR(?sense), 32) AS ?senseLabel)
      OPTIONAL {
        ?sense wdt:P18 ?images .
      }
    }

  }
  BIND(IF(BOUND(?form), "", CONCAT("search?language=en&q=", ?word)) AS ?wordUrl)

}
GROUP BY
  ?word ?wordUrl ?form ?formLabel
  ?lexeme ?lexemeLabel ?lexical_category ?lexical_categoryLabel
  ?sense ?senseLabel
ORDER BY ?word
Daniel-Mietchen commented 4 years ago

I left a note at https://www.wikidata.org/w/index.php?title=Wikidata:Request_a_query&oldid=1130905588#Getting_the_lexeme_for_a_given_form .

Daniel-Mietchen commented 4 years ago

Got some great responses there. While not precisely what I was after, I think they can help in other ways.

Daniel-Mietchen commented 4 years ago

Finally got the problematic query to work by making some more clauses OPTIONAL :

################
# Checking whether strings from the titles of publications already exist as lexemes
# The query has three parts:
#   I - get a list of publications on a given topic
#  II - extract strings from the titles
# III - check whether these strings exist as Wikidata lexemes
################

SELECT DISTINCT
  ?word ?wordUrl
  ?form ?formLabel
  ?lexeme ?lexemeLabel
  ?lexical_category ?lexical_categoryLabel
  (GROUP_CONCAT(DISTINCT ?featureLabel; separator=" // ") AS ?features)
  ?sense ?senseLabel
  (IRI(CONCAT("https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file&width=100&wpvalue=", 
          SUBSTR(STR(SAMPLE(?images)), 52))) AS ?sense_image)
WHERE {

#   I - get a list of publications on a given topic

  {
    SELECT DISTINCT ?x ?title WHERE {
      ?x wdt:P921 wd:Q202864 ;  # Zika virus
         wdt:P1476 ?title.
      FILTER(STRLEN(?title) >= 6)
    }
    LIMIT 200
  }

#  II - extract strings from the titles

  BIND(LCASE(?title) AS ?ltitle)
  BIND(REPLACE(?ltitle, "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w1)
  BIND(REPLACE(STRAFTER(?ltitle, ?w1), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w2)
  BIND(REPLACE(STRAFTER(?ltitle, ?w2), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w3)
  VALUES ?w_ { 1 2 3 }
  BIND(IF(?w_ = 1, ?w1, IF(?w_ = 2, ?w2, ?w3)) AS ?word)
  FILTER(REGEX(?word, "^\\w+$")) # since ?w may evaluate to an empty string, e.g. for one-word titles

  FILTER (LANG(?word) = "en")

# III - check whether these strings exist as Wikidata lexemes
# This part is taken from https://tools.wmflabs.org/ordia/text-to-lexemes

  OPTIONAL {
    ?form ontolex:representation ?word . 
    OPTIONAL {
      ?form wikibase:grammaticalFeature ?feature .
      BIND(STR(?feature) AS ?default_featureLabel)
      OPTIONAL {
        ?feature rdfs:label ?featureLabel_ .
        FILTER (LANG(?featureLabel_) = "en")
      }
      BIND(COALESCE(?featureLabel_, ?default_featureLabel) AS ?featureLabel)
    }
    ?form ontolex:representation ?formLabel .

    OPTIONAL { ?lexeme ontolex:lexicalForm ?form . }

    OPTIONAL { ?lexeme wikibase:lexicalCategory ?lexical_category .
    BIND(STR(?lexical_category) AS ?default_lexical_categoryLabel)
    OPTIONAL {
      ?lexical_category rdfs:label ?lexical_categoryLabel_ .
      FILTER (LANG(?lexical_categoryLabel_) = 'en')
    }
    BIND(COALESCE(?lexical_categoryLabel_, ?default_lexical_categoryLabel) AS
     ?lexical_categoryLabel)
    ?lexeme wikibase:lemma ?lexemeLabel .
              }

    OPTIONAL {
      ?lexeme ontolex:sense ?sense .
      BIND(SUBSTR(STR(?sense), 32) AS ?senseLabel)
      OPTIONAL {
        ?sense wdt:P18 ?images .
      }
    }

  }
  BIND(IF(BOUND(?form), "", CONCAT("search?language=en&q=", ?word)) AS ?wordUrl)
}
GROUP BY
  ?word ?wordUrl ?form ?formLabel
  ?lexeme ?lexemeLabel ?lexical_category ?lexical_categoryLabel
  ?sense ?senseLabel
ORDER BY ?word

For our use cases, it might actually make sense to split the functionality such that there is one query for existing lexemes and one for non-existing ones, which are added in below:

################
# Checking whether strings from the titles of publications already exist as lexemes
# The query has three parts:
#   I - get a list of publications on a given topic
#  II - extract strings from the titles
# III - check whether these strings exist as Wikidata lexemes
################

SELECT DISTINCT
  ?word 
  ?form ?formLabel
  ?lexeme ?lexemeLabel
  ?lexical_category ?lexical_categoryLabel
  (GROUP_CONCAT(DISTINCT ?featureLabel; separator=" // ") AS ?features)
  ?sense ?senseLabel
  (IRI(CONCAT("https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file&width=100&wpvalue=", 
          SUBSTR(STR(SAMPLE(?images)), 52))) AS ?sense_image)
WHERE {

#   I - get a list of publications on a given topic

  {
    SELECT DISTINCT ?x ?title WHERE {
      ?x wdt:P921 wd:Q202864 ;  # Zika virus
         wdt:P1476 ?title.
      FILTER(STRLEN(?title) >= 6)
    }
    LIMIT 200
  }

#  II - extract strings from the titles

  BIND(LCASE(?title) AS ?ltitle)
  BIND(REPLACE(?ltitle, "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w1)
  BIND(REPLACE(STRAFTER(?ltitle, ?w1), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w2)
  BIND(REPLACE(STRAFTER(?ltitle, ?w2), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w3)
  VALUES ?w_ { 1 2 3 }
  BIND(IF(?w_ = 1, ?w1, IF(?w_ = 2, ?w2, ?w3)) AS ?word)
  FILTER(REGEX(?word, "^\\w+$")) # since ?w may evaluate to an empty string, e.g. for one-word titles

  FILTER (LANG(?word) = "en")

# III - check whether these strings exist as Wikidata lexemes
# This part is taken from https://tools.wmflabs.org/ordia/text-to-lexemes

  OPTIONAL {
    ?form ontolex:representation ?word . 
    OPTIONAL {
      ?form wikibase:grammaticalFeature ?feature .
      BIND(STR(?feature) AS ?default_featureLabel)
      OPTIONAL {
        ?feature rdfs:label ?featureLabel_ .
        FILTER (LANG(?featureLabel_) = "en")
      }
      BIND(COALESCE(?featureLabel_, ?default_featureLabel) AS ?featureLabel)
    }
    ?form ontolex:representation ?formLabel .

    OPTIONAL { ?lexeme ontolex:lexicalForm ?form . }

    OPTIONAL { ?lexeme wikibase:lexicalCategory ?lexical_category .
    BIND(STR(?lexical_category) AS ?default_lexical_categoryLabel)
    OPTIONAL {
      ?lexical_category rdfs:label ?lexical_categoryLabel_ .
      FILTER (LANG(?lexical_categoryLabel_) = 'en')
    }
    BIND(COALESCE(?lexical_categoryLabel_, ?default_lexical_categoryLabel) AS
     ?lexical_categoryLabel)
    ?lexeme wikibase:lemma ?lexemeLabel .
              }

    OPTIONAL {
      ?lexeme ontolex:sense ?sense .
      BIND(SUBSTR(STR(?sense), 32) AS ?senseLabel)
      OPTIONAL {
        ?sense wdt:P18 ?images .
      }
    }

  }
  FILTER(BOUND(?form))
}
GROUP BY
  ?word ?form ?formLabel
  ?lexeme ?lexemeLabel ?lexical_category ?lexical_categoryLabel
  ?sense ?senseLabel
ORDER BY ?word

################
# Checking whether strings from the titles of publications already exist as lexemes
# The query has three parts:
#   I - get a list of publications on a given topic
#  II - extract strings from the titles
# III - check whether these strings exist as Wikidata lexemes
################

SELECT DISTINCT
  ?word ?wordUrl
WHERE {

#   I - get a list of publications on a given topic

  {
    SELECT DISTINCT ?x ?title WHERE {
      ?x wdt:P921 wd:Q202864 ;  # Zika virus
         wdt:P1476 ?title.
      FILTER(STRLEN(?title) >= 6)
    }
    LIMIT 200
  }

#  II - extract strings from the titles

  BIND(LCASE(?title) AS ?ltitle)
  BIND(REPLACE(?ltitle, "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w1)
  BIND(REPLACE(STRAFTER(?ltitle, ?w1), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w2)
  BIND(REPLACE(STRAFTER(?ltitle, ?w2), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w3)
  VALUES ?w_ { 1 2 3 }
  BIND(IF(?w_ = 1, ?w1, IF(?w_ = 2, ?w2, ?w3)) AS ?word)
  FILTER(REGEX(?word, "^\\w+$")) # since ?w may evaluate to an empty string, e.g. for one-word titles

  FILTER (LANG(?word) = "en")

# III - check whether these strings exist as Wikidata lexemes
# This part is taken from https://tools.wmflabs.org/ordia/text-to-lexemes

  OPTIONAL {
    ?form ontolex:representation ?word . 
  }
  FILTER(!BOUND(?form))
  BIND(CONCAT("search?language=en&q=", ?word) AS ?wordUrl)
}
GROUP BY
  ?word ?wordUrl 
ORDER BY ?word
Daniel-Mietchen commented 4 years ago

Played around some more with the queries I got from Request a Query.

The query that counts occurrences of certain elements in a string is rather quick and can be used to get things like

The query that checks words in publication titles as to whether they exist as Wikidata forms and lexemes has lots of possibilities too, e.g.

Daniel-Mietchen commented 4 years ago

That latter query has a part involving a string of multiples of " Z Z". I am thinking about replacing that with some representation of that as a binary number, Base32 or similar.

Daniel-Mietchen commented 4 years ago

I set up Q87572880 which has a numerical value that can be represented as a sequence of 50 repetitions of "10".

Daniel-Mietchen commented 4 years ago

Before setting up Q87572880, I tried to use Q87513951, whose binary representation has the same string as the decimal one of Q87572880.

Daniel-Mietchen commented 4 years ago

Here is the "Word cloud for topic" query from above, using the numeric value of Q87572880.

Daniel-Mietchen commented 4 years ago

A variant of the query for COVID-19.

Daniel-Mietchen commented 2 years ago

I received an inquiry regarding the purpose of the item Q87572880, as per its talk page.