EBISPOT / efo

Github repo for the Experimental Factor Ontology (EFO)
https://www.ebi.ac.uk/efo/
56 stars 13 forks source link

Some EFO Terms have multiple rdfs:label values #871

Closed dhimmel closed 3 years ago

dhimmel commented 4 years ago

A small number of EFO terms have multiple rdfs:label values:

Query

# SPARQL query for EFO terms with multiple rdfs:label values
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
select ?efo_uri ?n_labels ?label_str ?language
WHERE {
  {
    SELECT ?efo_uri (COUNT(*) AS ?n_labels)
    WHERE {
      ?efo_uri rdf:type owl:Class.
      OPTIONAL {?efo_uri rdfs:label ?label}
    }
    GROUP BY ?efo_uri
    HAVING (?n_labels > 1)
  }
  ?efo_uri rdfs:label ?label .
  BIND(STR(?label) AS ?label_str)
  BIND(LANG(?label) AS ?language)
}
ORDER BY ?efo_uri ?label_str ?language

Query output on EFO v3.22.0

?efo_uri ?n_labels ?label_str ?language
http://dbpedia.org/resource/China 2 China  
http://dbpedia.org/resource/China 2 China en
http://dbpedia.org/resource/India 2 India  
http://dbpedia.org/resource/India 2 India en
http://dbpedia.org/resource/Japan 2 Japan  
http://dbpedia.org/resource/Japan 2 Japan en
http://dbpedia.org/resource/Philippines 2 Philippines  
http://dbpedia.org/resource/Philippines 2 Philippines en
http://dbpedia.org/resource/Republic_of_Ireland 2 Republic of Ireland  
http://dbpedia.org/resource/Republic_of_Ireland 2 Republic of Ireland en
http://purl.obolibrary.org/obo/CHEBI_17929 2 N(omega),N(omega)-dimethyl-L-arginine  
http://purl.obolibrary.org/obo/CHEBI_17929 2 asymmetric dimethylarginine  
http://purl.obolibrary.org/obo/CHEBI_25682 2 N(omega),N'(omega)-dimethyl-L-arginine  
http://purl.obolibrary.org/obo/CHEBI_25682 2 symmetric dimethylarginine  
http://purl.obolibrary.org/obo/CL_0000000 2 cell  
http://purl.obolibrary.org/obo/CL_0000000 2 cell en
http://purl.obolibrary.org/obo/CL_0000540 2 neuron  
http://purl.obolibrary.org/obo/CL_0000540 2 neuron en
http://purl.obolibrary.org/obo/GO_0008283 2 cell population proliferation  
http://purl.obolibrary.org/obo/GO_0008283 2 cell proliferation  
http://www.ebi.ac.uk/efo/EFO_1001870 2 late-onset Alzheimer's disease  
http://www.ebi.ac.uk/efo/EFO_1001870 2 late-onset Alzheimers disease  
http://www.orpha.net/ORDO/Orphanet_1020 2 Early-onset autosomal dominant Alzheimer disease  
http://www.orpha.net/ORDO/Orphanet_1020 2 Early-onset autosomal dominant Alzheimer's disease  
http://www.orpha.net/ORDO/Orphanet_137754 2 Aminoacylase 1 deficiency  
http://www.orpha.net/ORDO/Orphanet_137754 2 Neurological conditions associated with aminoacylase 1 deficiency  
http://www.orpha.net/ORDO/Orphanet_654 2 Nephroblastoma  
http://www.orpha.net/ORDO/Orphanet_654 2 Wilms' tumor  

Problem?

This makes it so SPARQL queries that want to show a label for each efo term are prone to returning duplicate rows per term. Although perhaps users should always account for this possibility? From https://www.w3.org/2004/12/q/doc/rdf-labels.html:

RDF provides a mechanism for these short names by using the rdfs:label property. A component can have any number of rdfs:label property values, although it is STRONGLY recommended that they should be distnguished from each other using an xml:lang attribute and that there should be only one label per language.

As seen in the output above, some duplicates go away once considering label plus language. But for http://www.ebi.ac.uk/efo/EFO_1001870 and some others, there are two labels both without a language specified.

zoependlington commented 4 years ago

Thanks for finding this! We'll look into a way to increase our checks and limit duplicate labels.

zoependlington commented 3 years ago

Notes for self: CL duplications coming from CL.

zoependlington commented 3 years ago

I have manually edited all of the terms that I could. It seems many o the terms in this list have an rdfs:label and a 'preferred label' which is coming up as a duplicate from your query. I wasn't able to replicate the results of the query, however, but have run a ROBOT report and our QC tests which did not highlight any further duplications.

dhimmel commented 3 years ago

Thanks @zoependlington for the work in https://github.com/EBISPOT/efo/commit/dedd1a0146fd0eff8f099b3195d2b51d2e4485d6 and https://github.com/obophenotype/cell-ontology/issues/841.

It seems many o the terms in this list have an rdfs:label and a 'preferred label' which is coming up as a duplicate from your query.

Interesting. The query matches rdfs:label. So does that predicate also match "preferred label" triples? How do you tell with SPARQL whether a label is a rdfs:label or "preferred label"?