acoli-repo / acoli-dicts

3000+ machine-readable open source dictionaries distributed by the Applied Computational Linguistics lab at the University of Augsburg, Germany, and by the research group Linked Open Dictionaries (LiODi, funded 2015-2020 by BMBF at Goethe University Frankfurt, Germany). All data provided in OntoLex-Lemon and TIAD-TSV.
Apache License 2.0
10 stars 2 forks source link

Apertium RDF - duplicated entries #8

Open jogracia opened 3 years ago

jogracia commented 3 years ago

The following query gets information about "abrupt"@en. This retrieves (wrongly, I guess) two different URIs for the corresponding lexical entry

PREFIX ontolex: <>
PREFIX dc: <>
PREFIX vartrans: <>
PREFIX lime: <>
PREFIX lexinfo: <>
PREFIX rdfs: <>

SELECT DISTINCT ?lex_entry ?pos ?source ?lexicon ?language ?written_rep
   ?lex_entry ontolex:lexicalForm ?lemon_form ;
      lexinfo:partOfSpeech ?pos ;
      dc:source ?source .

   ?lemon_form ontolex:writtenRep "abrupt"@en ;
      ontolex:writtenRep ?written_rep .

   ?lexicon lime:entry ?lex_entry ;
      lime:language ?language  .



  lex_entry pos source lexicon language written_rep
1 lexinfo:adjective "en" "abrupt"@en
2 lexinfo:adjective "en" "abrupt"@en

Observe the two URIs to represent the same entity:

Why is this happening?

jogracia commented 3 years ago

Adding here a preliminary answer by Max Ionov (22/5/20):

This is not so much about duplicate URIs, but about the same word from different lexicons. More specifically, the first “abrupt” comes from LexiconEN from either the EN-ES or EN-KK dictionaries. The second ones comes from EN-CA and it’s connected to its strange tagset.

This problem actually unearths several (which were already known): (a) converting the part of speech part of URIs to UD or other standartised tagset and (b) providing metadata about the dictionary from which the lexicon comes from.