fnielsen / ordia

Wikidata lexemes presentations
https://ordia.toolforge.org
Apache License 2.0
24 stars 13 forks source link

Language detection #109

Closed fnielsen closed 3 years ago

fnielsen commented 3 years ago

text-to-language language detection:

SELECT (GROUP_CONCAT(?mword; separator=" ") AS ?mwords) {
  BIND(1 AS ?dummy)
  VALUES ?word { "mod" "af" "individer" }
  {
    SELECT (COUNT(?lexeme) AS ?count) ?language_code {
      ?lexeme dct:language / wdt:P424 ?language_code .
    }
    GROUP BY ?language_code
    HAVING (?count > 100)
    ORDER BY DESC(?count)
  }
  BIND(CONCAT('"', ?word, '"@', ?language_code) AS ?mword)
}
GROUP BY ?dummy

Followed by

SELECT (COUNT(?lexeme) AS ?count) ?language (GROUP_CONCAT(?word; separator=" ") AS ?words) {
  VALUES ?word { "af"@fr "mod"@fr "individer"@sv "af"@sv "mod"@sv "individer"@eu "af"@eu "mod"@eu "individer"@he "af"@he "mod"@he "individer"@la "af"@la "mod"@la "individer"@en "af"@en "mod"@en "individer"@ru "af"@ru "mod"@ru "individer"@eo "af"@eo "mod"@eo "individer"@ko "af"@ko "mod"@ko "individer"@bfi "af"@bfi "mod"@bfi "individer"@nl "af"@nl "mod"@nl "individer"@uk "af"@uk "mod"@uk "individer"@cy "af"@cy "mod"@cy "individer"@pt "af"@pt "mod"@pt "individer"@zh "af"@zh "mod"@zh "individer"@br "af"@br "mod"@br "individer"@bg "af"@bg "mod"@bg "individer"@ms "af"@ms "mod"@ms "individer"@tg "af"@tg "mod"@tg "individer"@se "af"@se "mod"@se "individer"@ta "af"@ta "mod"@ta "individer"@non "af"@non "mod"@non "individer"@it "af"@it "mod"@it "individer"@zh-min-nan "af"@zh-min-nan "mod"@zh-min-nan "individer"@nan "af"@nan "mod"@nan "individer"@fi "af"@fi "mod"@fi "individer"@jbo "af"@jbo "mod"@jbo "individer"@ml "af"@ml "mod"@ml "individer"@ja "af"@ja "mod"@ja "individer"@ku "af"@ku "mod"@ku "individer"@bn "af"@bn "mod"@bn "individer"@ar "af"@ar "mod"@ar "individer"@nb "af"@nb "mod"@nb "individer"@es "af"@es "mod"@es "individer"@pl "af"@pl "mod"@pl "individer"@nn "af"@nn "mod"@nn "individer"@sk "af"@sk "mod"@sk "individer"@da "af"@da "mod"@da "individer"@de "af"@de "mod"@de "individer"@cs "af"@cs "mod"@cs "individer"@fr }
  ?lexeme dct:language ?language ;
          ontolex:lexicalForm / ontolex:representation ?word .
}
GROUP BY ?language
fnielsen commented 3 years ago

For the problem of matching see also https://stackoverflow.com/questions/40246175/sparql-matching-literals-with-any-language-tags-without-run-into-timeout/63414341#63414341

fnielsen commented 3 years ago

This is now running at https://ordia.toolforge.org/text-to-languages