geneontology / go-ontology

Source ontology files for the Gene Ontology
http://geneontology.org/page/download-ontology
Creative Commons Attribution 4.0 International
220 stars 40 forks source link

Loading COG2GO #17493

Closed pgaudet closed 5 years ago

pgaudet commented 5 years ago

Moving discussion about COG2GO from https://github.com/geneontology/go-ontology/issues/16989

Jakob Russel developed a COG2GO mapping http://mibi.galaxy.bio.ku.dk/russel/mappings/cog2go

@cmungall proposes to load that in the go-edit file.

I tried to download the file but I downloaded a file that seems partial (because the last line, COG:COG0787, has no mapping). The file is >600K lines long; with lots of duplicates - when I remove duplicates I get 35,000 lines.

That seems like a lot ?? We should have a look at the mappings before integrating it.

For example COG:COG0156 7-keto-8-aminopelargonate synthetase or related enzyme has these mappings: COG:COG0156 > GO:0003870 5-aminolevulinate synthase activity COG:COG0156 > GO:0030170 pyridoxal phosphate binding COG:COG0156 > GO:0033014 tetrapyrrole biosynthetic process

The COG label and the mapped activity seem different.

Russel88 commented 5 years ago

I'm the contributor of the cog2go mapping file.

I forgot to remove duplicates. I have updated the file now and it's much smaller. I also included a md5 checksum.

I think the problem with the mappings is that they are based on all Uniprot entries that have both COG and GO. Some of these cross-mappings could be wrong or misleading. One solution is to keep mappings that are found in several proteins. E.g. for the example you give (COG0156) these are the 5 mappings found in most Uniprot entries: 3028 COG:COG0156 > GO:0030170 1953 COG:COG0156 > GO:0009058 867 COG:COG0156 > GO:0008710 631 COG:COG0156 > GO:0008890 490 COG:COG0156 > GO:0009102 467 COG:COG0156 > GO:0003870

pgaudet commented 5 years ago

From @Russel88

Dear all,

I am the contributor of the cog2go mapping file. @pgaudet I forgot to remove duplicates. I have updated the file now and it's much smaller. I also included a md5 checksum.

Cheers, Jakob

pgaudet commented 5 years ago

Great ! Thanks - We'll have another look at the file.

pgaudet commented 5 years ago

Hi @Russel88

Can you explain your process a little more ? For example COG0001 Glutamate-1-semialdehyde aminotransferase has like 50 GO mappings:

annotation_class,annotation_class_label GO:0019354,siroheme biosynthetic process GO:0031177,phosphopantetheine binding GO:0048046,apoplast GO:0009058,biosynthetic process GO:0009236,cobalamin biosynthetic process GO:0009570,chloroplast stroma GO:0005524,ATP binding GO:0005634,nucleus GO:0005829,cytosol GO:0005886,plasma membrane GO:0005737,cytoplasm GO:0004672,protein kinase activity GO:0004523,RNA-DNA hybrid ribonuclease activity GO:0018580,nitronate monooxygenase activity GO:0043565,sequence-specific DNA binding GO:0043115,precorrin-2 dehydrogenase activity GO:0004314,[acyl-carrier-protein] S-malonyltransferase activity GO:0042286,"glutamate-1-semialdehyde 2,1-aminomutase activity" GO:0003676,nucleic acid binding GO:0042802,identical protein binding GO:0042803,protein homodimerization activity GO:0003824,catalytic activity GO:0003700,DNA-binding transcription factor activity GO:0030170,pyridoxal phosphate binding GO:0046872,metal ion binding GO:0047879,erythronolide synthase activity GO:0009507,chloroplast GO:0009941,chloroplast envelope GO:0016869,"intramolecular transferase activity, transferring amino groups" GO:0016874,ligase activity GO:0016740,transferase activity GO:0016747,"transferase activity, transferring acyl groups other than amino-acyl groups" GO:0016779,nucleotidyltransferase activity GO:0016787,hydrolase activity GO:0016788,"hydrolase activity, acting on ester bonds" GO:0016705,"oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen" GO:0016021,integral component of membrane GO:0016491,oxidoreductase activity GO:0008483,transaminase activity GO:0047462,phenylalanine racemase (ATP-hydrolyzing) activity GO:0047689,aspartate racemase activity GO:0033014,tetrapyrrole biosynthetic process GO:0006779,porphyrin-containing compound biosynthetic process GO:0006782,protoporphyrinogen IX biosynthetic process GO:0050157,ornithine racemase activity GO:0015995,chlorophyll biosynthetic process GO:0051266,sirohydrochlorin ferrochelatase activity GO:0008881,glutamate racemase activity GO:0008781,N-acylneuraminate cytidylyltransferase activity

  1. Ideally you would only map to the most granular term (otherwise some high level terms will be mapped to all COGs, I don't think this is desirable)
  2. Most top level enzymatic classes are present: GO:0016491,oxidoreductase activity GO:0016874,ligase activity GO:0016740,transferase activity GO:0016787,hydrolase activity That seems unlikely.
  3. There seem to be many false positives: GO:0004672,protein kinase activity GO:0004523,RNA-DNA hybrid ribonuclease activity GO:0003700,DNA-binding transcription factor activity Plus many widely different enzymatic activities.

Is this correct ?

Thanks, Pascale

Russel88 commented 5 years ago

Hi @pgaudet

No, I don't think it's correct. There are too many false positives. The way I did it was to find all UniProt protein entries that have both a COG and GO annotation. I then used this information to make the cross-mappings.

One way to fix it could be to take the COG->GO mappings found in many proteins. E.g. the five most COG->GO mappings for COG0001 are (numbers are # of proteins in Uniprot with this mapping):
1235 COG:COG0001 > GO:0005737, cytoplasm 1235 COG:COG0001 > GO:0006782, protoporphyrinogen IX biosynthetic process 1344 COG:COG0001 > GO:0042286, glutamate-1-semialdehyde 2,1-aminomutase activity 1680 COG:COG0001 > GO:0030170, pyridoxal phosphate binding 1682 COG:COG0001 > GO:0008483, transaminase activity

These all seem correct: https://biocyc.org/gene?orgid=ECOLI&id=GSAAMINOTRANS-MONOMER#tab=GO

How to set the cutoff then becomes a challenge though.

Cheers,

pgaudet commented 5 years ago

Thanks @Russel88 Can you give me an example of a UniProt entry that has a GOC reference?

Russel88 commented 5 years ago

https://www.uniprot.org/uniprot/A8J7H3