Open Daniel-Mietchen opened 4 years ago
X might as well be - perhaps with some pre-processing to enhance performance and avoid copyright issues - something else than suggested above, e.g.
I took the "Extraction" query from https://tools.wmflabs.org/ordia/text-to-lexemes and combined it with the title word extraction query from https://www.wikidata.org/wiki/Wikidata:University_of_Virginia/Listeria/UVa_people/Common_words_in_titles_of_UVA-coauthored_publications_without_P921_(main_subject)_statement .
I could not get the lexeme part to work yet, but the basic parts of the pipeline work with this query:
SELECT DISTINCT
?word ?wordUrl
?form ?formLabel
?lexeme ?lexemeLabel
?lexical_category ?lexical_categoryLabel
(GROUP_CONCAT(DISTINCT ?featureLabel; separator=" // ") AS ?features)
?sense ?senseLabel
(IRI(CONCAT("https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file&width=100&wpvalue=",
SUBSTR(STR(SAMPLE(?images)), 52))) AS ?sense_image)
WHERE {
{
SELECT DISTINCT ?x ?title WHERE {
?x wdt:P921 wd:Q202864 ;
wdt:P1476 ?title.
FILTER(STRLEN(?title) >= 6)
}
LIMIT 100
}
BIND(LCASE(?title) AS ?ltitle)
BIND(REPLACE(?ltitle, "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w1)
BIND(REPLACE(STRAFTER(?ltitle, ?w1), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w2)
BIND(REPLACE(STRAFTER(?ltitle, ?w2), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w3)
VALUES ?w_ { 1 2 3 }
BIND(IF(?w_ = 1, ?w1, IF(?w_ = 2, ?w2, ?w3)) AS ?word)
FILTER(REGEX(?word, "^\\w+$")) # since ?w may evaluate to an empty string, e.g. for one-word titles
FILTER (LANG(?word) = "en")
OPTIONAL {
?form ontolex:representation ?word .
OPTIONAL {
?form wikibase:grammaticalFeature ?feature .
BIND(STR(?feature) AS ?default_featureLabel)
OPTIONAL {
?feature rdfs:label ?featureLabel_ .
FILTER (LANG(?featureLabel_) = "en")
}
BIND(COALESCE(?featureLabel_, ?default_featureLabel) AS ?featureLabel)
}
?form ontolex:representation ?formLabel .
}
BIND(IF(BOUND(?form), "", CONCAT("search?language=en&q=", ?word)) AS ?wordUrl)
}
GROUP BY
?word ?wordUrl ?form ?formLabel
?lexeme ?lexemeLabel ?lexical_category ?lexical_categoryLabel
?sense ?senseLabel
ORDER BY ?word
The full query (i.e. with the lexeme part) that I tried and could not get to work is here:
SELECT DISTINCT
?word ?wordUrl
?form ?formLabel
?lexeme ?lexemeLabel
?lexical_category ?lexical_categoryLabel
(GROUP_CONCAT(DISTINCT ?featureLabel; separator=" // ") AS ?features)
?sense ?senseLabel
(IRI(CONCAT("https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file&width=100&wpvalue=",
SUBSTR(STR(SAMPLE(?images)), 52))) AS ?sense_image)
WHERE {
{
SELECT DISTINCT ?x ?title WHERE {
?x wdt:P921 wd:Q202864 ;
wdt:P1476 ?title.
FILTER(STRLEN(?title) >= 6)
}
LIMIT 100
}
BIND(LCASE(?title) AS ?ltitle)
BIND(REPLACE(?ltitle, "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w1)
BIND(REPLACE(STRAFTER(?ltitle, ?w1), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w2)
BIND(REPLACE(STRAFTER(?ltitle, ?w2), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w3)
VALUES ?w_ { 1 2 3 }
BIND(IF(?w_ = 1, ?w1, IF(?w_ = 2, ?w2, ?w3)) AS ?word)
FILTER(REGEX(?word, "^\\w+$")) # since ?w may evaluate to an empty string, e.g. for one-word titles
FILTER (LANG(?word) = "en")
OPTIONAL {
?form ontolex:representation ?word .
OPTIONAL {
?form wikibase:grammaticalFeature ?feature .
BIND(STR(?feature) AS ?default_featureLabel)
OPTIONAL {
?feature rdfs:label ?featureLabel_ .
FILTER (LANG(?featureLabel_) = "en")
}
BIND(COALESCE(?featureLabel_, ?default_featureLabel) AS ?featureLabel)
}
?form ontolex:representation ?formLabel .
?lexeme wikibase:lexicalCategory ?lexical_category .
BIND(STR(?lexical_category) AS ?default_lexical_categoryLabel)
OPTIONAL {
?lexical_category rdfs:label ?lexical_categoryLabel_ .
FILTER (LANG(?lexical_categoryLabel_) = 'en')
}
BIND(COALESCE(?lexical_categoryLabel_, ?default_lexical_categoryLabel) AS
?lexical_categoryLabel)
?lexeme ontolex:lexicalForm ?form .
?lexeme wikibase:lemma ?lexemeLabel .
OPTIONAL {
?lexeme ontolex:sense ?sense .
BIND(SUBSTR(STR(?sense), 32) AS ?senseLabel)
OPTIONAL {
?sense wdt:P18 ?images .
}
}
}
BIND(IF(BOUND(?form), "", CONCAT("search?language=en&q=", ?word)) AS ?wordUrl)
}
GROUP BY
?word ?wordUrl ?form ?formLabel
?lexeme ?lexemeLabel ?lexical_category ?lexical_categoryLabel
?sense ?senseLabel
ORDER BY ?word
Got some great responses there. While not precisely what I was after, I think they can help in other ways.
Finally got the problematic query to work by making some more clauses OPTIONAL
:
################
# Checking whether strings from the titles of publications already exist as lexemes
# The query has three parts:
# I - get a list of publications on a given topic
# II - extract strings from the titles
# III - check whether these strings exist as Wikidata lexemes
################
SELECT DISTINCT
?word ?wordUrl
?form ?formLabel
?lexeme ?lexemeLabel
?lexical_category ?lexical_categoryLabel
(GROUP_CONCAT(DISTINCT ?featureLabel; separator=" // ") AS ?features)
?sense ?senseLabel
(IRI(CONCAT("https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file&width=100&wpvalue=",
SUBSTR(STR(SAMPLE(?images)), 52))) AS ?sense_image)
WHERE {
# I - get a list of publications on a given topic
{
SELECT DISTINCT ?x ?title WHERE {
?x wdt:P921 wd:Q202864 ; # Zika virus
wdt:P1476 ?title.
FILTER(STRLEN(?title) >= 6)
}
LIMIT 200
}
# II - extract strings from the titles
BIND(LCASE(?title) AS ?ltitle)
BIND(REPLACE(?ltitle, "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w1)
BIND(REPLACE(STRAFTER(?ltitle, ?w1), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w2)
BIND(REPLACE(STRAFTER(?ltitle, ?w2), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w3)
VALUES ?w_ { 1 2 3 }
BIND(IF(?w_ = 1, ?w1, IF(?w_ = 2, ?w2, ?w3)) AS ?word)
FILTER(REGEX(?word, "^\\w+$")) # since ?w may evaluate to an empty string, e.g. for one-word titles
FILTER (LANG(?word) = "en")
# III - check whether these strings exist as Wikidata lexemes
# This part is taken from https://tools.wmflabs.org/ordia/text-to-lexemes
OPTIONAL {
?form ontolex:representation ?word .
OPTIONAL {
?form wikibase:grammaticalFeature ?feature .
BIND(STR(?feature) AS ?default_featureLabel)
OPTIONAL {
?feature rdfs:label ?featureLabel_ .
FILTER (LANG(?featureLabel_) = "en")
}
BIND(COALESCE(?featureLabel_, ?default_featureLabel) AS ?featureLabel)
}
?form ontolex:representation ?formLabel .
OPTIONAL { ?lexeme ontolex:lexicalForm ?form . }
OPTIONAL { ?lexeme wikibase:lexicalCategory ?lexical_category .
BIND(STR(?lexical_category) AS ?default_lexical_categoryLabel)
OPTIONAL {
?lexical_category rdfs:label ?lexical_categoryLabel_ .
FILTER (LANG(?lexical_categoryLabel_) = 'en')
}
BIND(COALESCE(?lexical_categoryLabel_, ?default_lexical_categoryLabel) AS
?lexical_categoryLabel)
?lexeme wikibase:lemma ?lexemeLabel .
}
OPTIONAL {
?lexeme ontolex:sense ?sense .
BIND(SUBSTR(STR(?sense), 32) AS ?senseLabel)
OPTIONAL {
?sense wdt:P18 ?images .
}
}
}
BIND(IF(BOUND(?form), "", CONCAT("search?language=en&q=", ?word)) AS ?wordUrl)
}
GROUP BY
?word ?wordUrl ?form ?formLabel
?lexeme ?lexemeLabel ?lexical_category ?lexical_categoryLabel
?sense ?senseLabel
ORDER BY ?word
For our use cases, it might actually make sense to split the functionality such that there is one query for existing lexemes and one for non-existing ones, which are added in below:
################
# Checking whether strings from the titles of publications already exist as lexemes
# The query has three parts:
# I - get a list of publications on a given topic
# II - extract strings from the titles
# III - check whether these strings exist as Wikidata lexemes
################
SELECT DISTINCT
?word
?form ?formLabel
?lexeme ?lexemeLabel
?lexical_category ?lexical_categoryLabel
(GROUP_CONCAT(DISTINCT ?featureLabel; separator=" // ") AS ?features)
?sense ?senseLabel
(IRI(CONCAT("https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file&width=100&wpvalue=",
SUBSTR(STR(SAMPLE(?images)), 52))) AS ?sense_image)
WHERE {
# I - get a list of publications on a given topic
{
SELECT DISTINCT ?x ?title WHERE {
?x wdt:P921 wd:Q202864 ; # Zika virus
wdt:P1476 ?title.
FILTER(STRLEN(?title) >= 6)
}
LIMIT 200
}
# II - extract strings from the titles
BIND(LCASE(?title) AS ?ltitle)
BIND(REPLACE(?ltitle, "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w1)
BIND(REPLACE(STRAFTER(?ltitle, ?w1), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w2)
BIND(REPLACE(STRAFTER(?ltitle, ?w2), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w3)
VALUES ?w_ { 1 2 3 }
BIND(IF(?w_ = 1, ?w1, IF(?w_ = 2, ?w2, ?w3)) AS ?word)
FILTER(REGEX(?word, "^\\w+$")) # since ?w may evaluate to an empty string, e.g. for one-word titles
FILTER (LANG(?word) = "en")
# III - check whether these strings exist as Wikidata lexemes
# This part is taken from https://tools.wmflabs.org/ordia/text-to-lexemes
OPTIONAL {
?form ontolex:representation ?word .
OPTIONAL {
?form wikibase:grammaticalFeature ?feature .
BIND(STR(?feature) AS ?default_featureLabel)
OPTIONAL {
?feature rdfs:label ?featureLabel_ .
FILTER (LANG(?featureLabel_) = "en")
}
BIND(COALESCE(?featureLabel_, ?default_featureLabel) AS ?featureLabel)
}
?form ontolex:representation ?formLabel .
OPTIONAL { ?lexeme ontolex:lexicalForm ?form . }
OPTIONAL { ?lexeme wikibase:lexicalCategory ?lexical_category .
BIND(STR(?lexical_category) AS ?default_lexical_categoryLabel)
OPTIONAL {
?lexical_category rdfs:label ?lexical_categoryLabel_ .
FILTER (LANG(?lexical_categoryLabel_) = 'en')
}
BIND(COALESCE(?lexical_categoryLabel_, ?default_lexical_categoryLabel) AS
?lexical_categoryLabel)
?lexeme wikibase:lemma ?lexemeLabel .
}
OPTIONAL {
?lexeme ontolex:sense ?sense .
BIND(SUBSTR(STR(?sense), 32) AS ?senseLabel)
OPTIONAL {
?sense wdt:P18 ?images .
}
}
}
FILTER(BOUND(?form))
}
GROUP BY
?word ?form ?formLabel
?lexeme ?lexemeLabel ?lexical_category ?lexical_categoryLabel
?sense ?senseLabel
ORDER BY ?word
################
# Checking whether strings from the titles of publications already exist as lexemes
# The query has three parts:
# I - get a list of publications on a given topic
# II - extract strings from the titles
# III - check whether these strings exist as Wikidata lexemes
################
SELECT DISTINCT
?word ?wordUrl
WHERE {
# I - get a list of publications on a given topic
{
SELECT DISTINCT ?x ?title WHERE {
?x wdt:P921 wd:Q202864 ; # Zika virus
wdt:P1476 ?title.
FILTER(STRLEN(?title) >= 6)
}
LIMIT 200
}
# II - extract strings from the titles
BIND(LCASE(?title) AS ?ltitle)
BIND(REPLACE(?ltitle, "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w1)
BIND(REPLACE(STRAFTER(?ltitle, ?w1), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w2)
BIND(REPLACE(STRAFTER(?ltitle, ?w2), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w3)
VALUES ?w_ { 1 2 3 }
BIND(IF(?w_ = 1, ?w1, IF(?w_ = 2, ?w2, ?w3)) AS ?word)
FILTER(REGEX(?word, "^\\w+$")) # since ?w may evaluate to an empty string, e.g. for one-word titles
FILTER (LANG(?word) = "en")
# III - check whether these strings exist as Wikidata lexemes
# This part is taken from https://tools.wmflabs.org/ordia/text-to-lexemes
OPTIONAL {
?form ontolex:representation ?word .
}
FILTER(!BOUND(?form))
BIND(CONCAT("search?language=en&q=", ?word) AS ?wordUrl)
}
GROUP BY
?word ?wordUrl
ORDER BY ?word
Played around some more with the queries I got from Request a Query.
The query that counts occurrences of certain elements in a string is rather quick and can be used to get things like
The query that checks words in publication titles as to whether they exist as Wikidata forms and lexemes has lots of possibilities too, e.g.
That latter query has a part involving a string of multiples of " Z Z". I am thinking about replacing that with some representation of that as a binary number, Base32 or similar.
I set up Q87572880 which has a numerical value that can be represented as a sequence of 50 repetitions of "10".
Before setting up Q87572880, I tried to use Q87513951, whose binary representation has the same string as the decimal one of Q87572880.
Here is the "Word cloud for topic" query from above, using the numeric value of Q87572880.
A variant of the query for COVID-19.
I received an inquiry regarding the purpose of the item Q87572880, as per its talk page.
What I have in mind here is things like checking whether all the lexemes in X are in Wikidata, where X might be a list of