Closed fititnt closed 2 years ago
Okay, this one is a generic query
SELECT DISTINCT ?item ?itemLabel WHERE {
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
{
SELECT DISTINCT ?item WHERE {
{
?item p:P1585 ?statement0.
?statement0 (ps:P1585) _:anyValueP1585.
#FILTER(EXISTS { ?statement0 prov:wasDerivedFrom ?reference. })
}
}
}
}
However, we already query the human languages, (but we can workaroud it). Maybe this feature will be somewhat hardcoded, because if we implement at full potential, it would means also read the 1603_1_7 and "undestand" what each P means.
Great. We managed to use a pre-processor to create the queries (--lingua-divisioni=18 --lingua-paginae=1 and are paginators). I think the brazilian cities may be one of those codex with over 200 languages.
fititnt@bravo:/workspace/git/EticaAI/multilingual-lexicography-automation/officinam$ printf "P1585\n" | ./999999999/0/1603_3_12.py --actionem-sparql --de=P --query --lingua-divisioni=18 --lingua-paginae=1
SELECT (STRAFTER(STR(?item), "entity/") AS ?item__conceptum__codicem) ?item__rem__i_ara__is_arab ?item__rem__i_hye__is_armn ?item__rem__i_ben__is_beng ?item__rem__i_rus__is_cyrl ?item__rem__i_hin__is_deva ?item__rem__i_amh__is_ethi ?item__rem__i_kat__is_geor ?item__rem__i_grc__is_grek ?item__rem__i_guj__is_gujr ?item__rem__i_pan__is_guru ?item__rem__i_kan__is_knda ?item__rem__i_kor__is_hang ?item__rem__i_lzh__is_hant ?item__rem__i_heb__is_hebr ?item__rem__i_khm__is_khmr WHERE {
{
SELECT DISTINCT ?item WHERE {
?item p:P1585 ?statement0.
?statement0 (ps:P1585 ) _:anyValueP1585 .
}
}
OPTIONAL { ?item rdfs:label ?item__rem__i_ara__is_arab filter (lang(?item__rem__i_ara__is_arab) = "ar"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_hye__is_armn filter (lang(?item__rem__i_hye__is_armn) = "hy"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_ben__is_beng filter (lang(?item__rem__i_ben__is_beng) = "bn"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_rus__is_cyrl filter (lang(?item__rem__i_rus__is_cyrl) = "ru"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_hin__is_deva filter (lang(?item__rem__i_hin__is_deva) = "hi"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_amh__is_ethi filter (lang(?item__rem__i_amh__is_ethi) = "am"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_kat__is_geor filter (lang(?item__rem__i_kat__is_geor) = "ka"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_grc__is_grek filter (lang(?item__rem__i_grc__is_grek) = "grc"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_guj__is_gujr filter (lang(?item__rem__i_guj__is_gujr) = "gu"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_pan__is_guru filter (lang(?item__rem__i_pan__is_guru) = "pa"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_kan__is_knda filter (lang(?item__rem__i_kan__is_knda) = "kn"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_kor__is_hang filter (lang(?item__rem__i_kor__is_hang) = "ko"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_lzh__is_hant filter (lang(?item__rem__i_lzh__is_hant) = "lzh"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_heb__is_hebr filter (lang(?item__rem__i_heb__is_hebr) = "he"). }
OPTIONAL { ?item rdfs:label ?item__rem__i_khm__is_khmr filter (lang(?item__rem__i_khm__is_khmr) = "km"). }
bind(xsd:integer(strafter(str(?item), 'Q')) as ?id_numeric) .
}
ORDER BY ASC (?id_numeric)
Queries with so much itens varies a lot the runtime. Even with pagination, sometimes it go over 40 seconds (but if cached to 5 seconds). So I think we will definitely need to make some rudimentary way to on bash functions to check if timeouted and ajust the times for try again.
However, we maybe will need to create more than one query, because this strategy (to merge with datasets) would require we already know upfront what Wikidata Q is linked to IBGE code.
fititnt@bravo:/workspace/git/EticaAI/multilingual-lexicography-automation/officinam$ printf "P1585\n" | ./999999999/0/1603_3_12.py --actionem-sparql --de=P --query --ex-interlinguis
SELECT (?wikidata_p_value AS ?item__conceptum__codicem) (STRAFTER(STR(?item), "entity/") AS ?item__rem__i_qcc__is_zxxx__ix_wikiq) WHERE {
{
SELECT DISTINCT ?item WHERE {
?item p:P1585 ?statement0.
?statement0 (ps:P1585 ) _:anyValueP1585 .
}
}
?item wdt:P1585 ?wikidata_p_value .
}
ORDER BY ASC (?wikidata_p_value)
We did not even merged the 17 pages from 20 of languages and the file size already is 1,6MB. No idea how big this will be with all languages.
Humm... now the issue is do heavy optimization on the queries to mitigate the timeouts. The ones with over 5000 concepts and even splitting 20 parts the 1603_1_51 languages are the issue here.
Maybe one strategy would be allow removing the ORDER BY
on the queries which deal with languages and do it via client side (aka sort with bash / hxl cli tools).
The bash helper, while still need more testing, somewhat already deal with retrying again. For something such as P1585 it using now 1 + 20 queries.
However, later we obviously should get data from primary sources (in case of IBGE, I think https://servicodados.ibge.gov.br/api/docs/localidades do it) and use as primary reference, potentially validating information from Wikidata
Turns out that we can start to bootstrapping other tables (the ones already perfect on Wikidata) the same way done with IBGE municipalities, including translations on several languages!
However, the same ideal approaches (such as rely on primary sources, then increment with Wikidata) would somewhat apply too. Sometimes this may not be really relevant. For example, something not stricly a place (like the P6555 //identificador de Unidade Eleitoral brasileira//@por-Latn
) would not really make sense for the end user just print the municipalities.
Also, eventually we will need to think like somewhat as an Ontology, otherwise the #41 would be as efficient for general users.
Already implemented and used in practice. Closing for now.
One item from https://github.com/EticaAI/lexicographi-sine-finibus/issues/39, the
P1585 https://www.wikidata.org/wiki/Property:P1585 //Dicionários de bases de dados espaciais do Brasil//@por-Latn
actually is very well documented on Wikidata, so we would not need to fetch Wikidata Q one by one.It's a rare case something so perfect, but the idea here would be create an additional option on ./999999999/0/1603_3_12.py to create the SPARQL query for us.
This obviously will need pagination. If with ~300 Wikidata Q we already timeout with over 250 languages on 1603_1_51 (for now using 5 batches), with sometime with 5700 items, well, this will be fun