acoli-repo / acoli-dicts

3000+ machine-readable open source dictionaries distributed by the Applied Computational Linguistics lab at the University of Augsburg, Germany, and by the research group Linked Open Dictionaries (LiODi, funded 2015-2020 by BMBF at Goethe University Frankfurt, Germany). All data provided in OntoLex-Lemon and TIAD-TSV.
Apache License 2.0
10 stars 2 forks source link

Apertium RDF - duplicated entries #8

Open jogracia opened 3 years ago

jogracia commented 3 years ago

The following query gets information about "abrupt"@en. This retrieves (wrongly, I guess) two different URIs for the corresponding lexical entry

PREFIX ontolex: <http://www.w3.org/ns/lemon/ontolex#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX vartrans: <http://www.w3.org/ns/lemon/vartrans#>
PREFIX lime: <http://www.w3.org/ns/lemon/lime#>
PREFIX lexinfo: <http://www.lexinfo.net/ontology/2.0/lexinfo#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?lex_entry ?pos ?source ?lexicon ?language ?written_rep
FROM <http://linguistic.linkeddata.es/id/apertium-lexinfo/>
WHERE {
   ?lex_entry ontolex:lexicalForm ?lemon_form ;
      lexinfo:partOfSpeech ?pos ;
      dc:source ?source .

   ?lemon_form ontolex:writtenRep "abrupt"@en ;
      ontolex:writtenRep ?written_rep .

   ?lexicon lime:entry ?lex_entry ;
      lime:language ?language  .

}

Result:

  lex_entry pos source lexicon language written_rep
1 http://linguistic.linkeddata.es/id/apertium/lexiconEN/abrupt-en lexinfo:adjective https://github.com/apertium/apertium-trunk.git http://linguistic.linkeddata.es/id/apertium/lexiconEN "en" "abrupt"@en
2 http://linguistic.linkeddata.es/id/apertium/lexiconEN/abrupt-adj-en lexinfo:adjective https://github.com/apertium/apertium-trunk.git http://linguistic.linkeddata.es/id/apertium/lexiconEN "en" "abrupt"@en

Observe the two URIs to represent the same entity: http://linguistic.linkeddata.es/id/apertium/lexiconEN/abrupt-en http://linguistic.linkeddata.es/id/apertium/lexiconEN/abrupt-adj-en

Why is this happening?

jogracia commented 3 years ago

Adding here a preliminary answer by Max Ionov (22/5/20):

This is not so much about duplicate URIs, but about the same word from different lexicons. More specifically, the first “abrupt” comes from LexiconEN from either the EN-ES or EN-KK dictionaries. The second ones comes from EN-CA and it’s connected to its strange tagset.

This problem actually unearths several (which were already known): (a) converting the part of speech part of URIs to UD or other standartised tagset and (b) providing metadata about the dictionary from which the lexicon comes from.