geneontology / go-ontology

Source ontology files for the Gene Ontology
http://geneontology.org/page/download-ontology
Creative Commons Attribution 4.0 International
229 stars 40 forks source link

taxon IDs and names #3158

Closed gocentral closed 9 years ago

gocentral commented 19 years ago

E-mail from user Waclaw Kusnierczyk:

1) There is a mistake in the taxon id (NCBI taxon id) in terms associated with the taxon Tracheophyta (terms GO:0010051 and GO:0010087). The taxon id is 8023, should be 58023.

Here is an SQL patch:

update term_definition set term_definition := replace(term_definition, 'ncbi_taxonomy_id:8023', 'ncbi_taxonomy_id:58023') where term_definition regexp 'Tracheophyta';

(note that 'term_definition' refers once to a table, and once - three times, actually - to a column in that table, but it is clear for the sql interpreter. however, it should be seen as unfortunate in a database schema).

-- Waclaw Kusnierczyk

2)

There are two improper taxons used within GO term names/definitions that perhaps should be repaired.

'Gram-positive Bacteria' is in fact a GenBank common name for the taxon with NCBI id 1239 that has a scientific name 'Firmicutes'. For consistency, the scientific name should be used as in other cases of taxons in GO.

'Gram-negative Bacteria' is a non-existent taxon, presumably created by an editor/curator; though there are taxons such as 'Gram-negative Bacterium K3' in the species taxonomy, there is no such, common to them, taxon as 'Gram-negative Bacteria'. The closest are 'unclassified bacteria (miscellaneous)' (49928) and 'unclassified bacteria' (2323), though these are not necessarily equivalent to 'gram-negative'.

It might be discussed whether it is GO or the NCBI taxonomy to be modified, and whether such 'negative' terms (= taxons here) should be considered at all; however, for now, terms with '(sensu Gram-negative Bacteria)' in the name should somehow be corrected.

I'm going to check with Michelle Gwinn-Giglio before changing the gram positive/negative terms.

Thanks,

Jen

Reported by: jenclark

Original Ticket: "geneontology/ontology-requests/3169":https://sourceforge.net/p/geneontology/ontology-requests/3169

gocentral commented 19 years ago

Logged In: YES user_id=735846

Hi Wraclaw,

It's a bit weird this. When I look at GO:0010051 and GO:0010087 the taxon IDs in the term defs appear to be right.

'The process that gives rise to the patterning of the conducting tissues, as in, but not restricted to, the vascular plants (Tracheophyta, ncbi_taxonomy_id:58023).'

Is there any way that you could have lost a digit by accident? I'm not aware that those terms have been changed recently.

Are you using the latest version of the ontology, and are you using the obo format version of the file, or another format?

Thanks,

Jen

Original comment by: jenclark

gocentral commented 19 years ago

Logged In: YES user_id=615849

I use a mysql dump from December 22; it may be that this has been corrected since this date. However, to confirm, I have looked into the OBO format file yesterday (07.01.06), and here there is 8023, not 58023. Check this. There is some inconsistency on your side, then.

I could accept an explanation that 'somehow' the digit 5 disappeared, but since it did twice, and twice again in a separately downloaded file, the chance this explanation is correct is rather like lim/n->0 (n).

(Just checked: today's OBO file also does include the mistake.)

vQ

Original comment by: waku

gocentral commented 19 years ago

Logged In: YES user_id=735846

From e-mail:

Hi Jen,

Thanks for alerting me to this. Gram positives are technically called Firmicutes, so that change makes sense, although not everyone will be familiar with it. As for the Gram negatives, well the problem is that the Gram negatives are not a distinct taxonomic class. And in fact many, many bacteria can not be classified with Gram stain at all. The thing we were trying to capture is the nature of structures in proks with 2 membranes, and those come in many different taxonomic groups. A big group of Gram negatives is the Proteobacteria. We could use that. But I fear people will think its restricted to just them, even though the documentation says its not, I am afraid people might hesitate to use the term.

I'm not sure what to do. Could we use synonyms?

Michelle

Hi Michelle,

We had the same problem with the invertebrate group. We ended up listing the different formal taxa involved.

docs:

invertebrates No corresponding NCBI taxon There is no formal taxon to cover this group so either 'Protostomia' or 'Nematoda and Protostomia' should be used instead, depending on context.

We could have:

Gram positive

x term (sensu Firmicutes)

def: as in but not restricted to the gram positive bacteria (Firmicutes, ncbi_taxonomy_id:xxxxx).

Gram negative

x term (sensu Proteobacteria)

def: as in but not restricted to the gram negative bacteria (Proteobacteria, ncbi_taxonomy_id:xxxxx).

exact synonym: x term (sensu Gram negative bacteria)

Does that seem okay? I know it's a little sketchy.

Do you mind if I post your comments in the sourceforge item so the submitter can contribute? https://sourceforge.net/tracker/?group\_id=36855&atid=440764&func=detail&aid=1427439

Thanks,

Jen

Hi Jen,

I think this sounds fine.

Thanks,

Michelle

Original comment by: jenclark

gocentral commented 19 years ago

Logged In: YES user_id=436423

Acording to the CVS log, Mike Cherry fixed them last night. The gene_ontology.obo file thus has 58023 in the current revision (and the immediately preceding one; 3.1359 and 3.1360); revision 3.1358 and earlier had 8023.

m

Original comment by: mah11

gocentral commented 19 years ago

Logged In: YES user_id=735846

Mike has fixed point number 1.

Jen

Original comment by: jenclark

gocentral commented 19 years ago

Logged In: YES user_id=615849

Jen, your solution seems quite good. The issue is that both human users and our automated friends should have the information represented understandably and unambiguously.

The ncbi taxonomy id is the ultimate piece of information I would use in any automated attempt to reason over GO-and-species, because taxon names are not unique (see our previous correspondence, not posted here) and a taxon name may lead to, possibly many, distinct taxons. Not the taxon name currently withing GO, but possibly those added in the future. That's another reason GO term-taxon associations should really have a fair status, that is, be represented explicitly as relations, not name/definition modifications.

vQ

Original comment by: waku

gocentral commented 19 years ago

Logged In: YES user_id=735846

Hi Waclaw,

Thanks. :-) It's a bit tricky for the biologists, this, because these sensu terms taxon designations mean 'in the sense of, and sometimes it's not easy to remember that. A designation of 'gametogenesis sensu Magnoliophyta' means we are talking about the process of 'gametogenesis' as it is understood by researchers who work on flowering plants. However the term can still be used for annotation of gene products that are encoded in other species that have that same process but that are not flowering plants. That's why we say 'sensu', meaning "in the sense of". It's also why the def says ' as in, but not restricted to.'

So there are several things we need to take care of in the gram negative terms.

1) the taxon we choose must have that process occurring in the same way as the way that we intend the definition to indicate.

2) We must include a common usage version of the taxon name if possible so that people are not racking their brains to try to remember what complicated or obscure latin names mean.

3) The common usage name should cover exactly the same group as the latin taxon name. This is difficult for the gram negative terms and I'm not sure if we're going to foul things us by having a common name that isn't exactly the same as the latin name.

As you say the issue is that both human users and our automated friends should have the information represented understandably and unambiguously.

We also need internal consistency within the GO.

I'd be interested to get advice on this from other GO Consortium members.

I will write to a few people to get their advice.

Thanks,

Jen

Original comment by: jenclark

gocentral commented 19 years ago

Logged In: YES user_id=735846

Hi,

Michael says that looks fine so I've implemented it plus synonyms.

Jen

(I'm keeping this item open 'till I get feedback from Michelle on whether she wants hyphens in 'Gram negative'.)

(sensu Proteobacteria)

As in, but not restricted to, the Gram negative bacteria (Proteobacteria, ncbi_taxonomy_id:1224).

(sensu Firmicutes)

as in but not restricted to the Gram positive bacteria (Firmicutes, ncbi_taxonomy_id:1239).

Original comment by: jenclark

gocentral commented 19 years ago

Logged In: YES user_id=735846

The hyphenation has been standardised to non-hyphen now.

Jen

Original comment by: jenclark

gocentral commented 19 years ago

Original comment by: jenclark