Closed pgaudet closed 5 years ago
I'm the contributor of the cog2go mapping file.
I forgot to remove duplicates. I have updated the file now and it's much smaller. I also included a md5 checksum.
I think the problem with the mappings is that they are based on all Uniprot entries that have both COG and GO. Some of these cross-mappings could be wrong or misleading. One solution is to keep mappings that are found in several proteins. E.g. for the example you give (COG0156) these are the 5 mappings found in most Uniprot entries: 3028 COG:COG0156 > GO:0030170 1953 COG:COG0156 > GO:0009058 867 COG:COG0156 > GO:0008710 631 COG:COG0156 > GO:0008890 490 COG:COG0156 > GO:0009102 467 COG:COG0156 > GO:0003870
From @Russel88
Dear all,
I am the contributor of the cog2go mapping file. @pgaudet I forgot to remove duplicates. I have updated the file now and it's much smaller. I also included a md5 checksum.
Cheers, Jakob
Great ! Thanks - We'll have another look at the file.
Hi @Russel88
Can you explain your process a little more ? For example COG0001 Glutamate-1-semialdehyde aminotransferase has like 50 GO mappings:
annotation_class,annotation_class_label GO:0019354,siroheme biosynthetic process GO:0031177,phosphopantetheine binding GO:0048046,apoplast GO:0009058,biosynthetic process GO:0009236,cobalamin biosynthetic process GO:0009570,chloroplast stroma GO:0005524,ATP binding GO:0005634,nucleus GO:0005829,cytosol GO:0005886,plasma membrane GO:0005737,cytoplasm GO:0004672,protein kinase activity GO:0004523,RNA-DNA hybrid ribonuclease activity GO:0018580,nitronate monooxygenase activity GO:0043565,sequence-specific DNA binding GO:0043115,precorrin-2 dehydrogenase activity GO:0004314,[acyl-carrier-protein] S-malonyltransferase activity GO:0042286,"glutamate-1-semialdehyde 2,1-aminomutase activity" GO:0003676,nucleic acid binding GO:0042802,identical protein binding GO:0042803,protein homodimerization activity GO:0003824,catalytic activity GO:0003700,DNA-binding transcription factor activity GO:0030170,pyridoxal phosphate binding GO:0046872,metal ion binding GO:0047879,erythronolide synthase activity GO:0009507,chloroplast GO:0009941,chloroplast envelope GO:0016869,"intramolecular transferase activity, transferring amino groups" GO:0016874,ligase activity GO:0016740,transferase activity GO:0016747,"transferase activity, transferring acyl groups other than amino-acyl groups" GO:0016779,nucleotidyltransferase activity GO:0016787,hydrolase activity GO:0016788,"hydrolase activity, acting on ester bonds" GO:0016705,"oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen" GO:0016021,integral component of membrane GO:0016491,oxidoreductase activity GO:0008483,transaminase activity GO:0047462,phenylalanine racemase (ATP-hydrolyzing) activity GO:0047689,aspartate racemase activity GO:0033014,tetrapyrrole biosynthetic process GO:0006779,porphyrin-containing compound biosynthetic process GO:0006782,protoporphyrinogen IX biosynthetic process GO:0050157,ornithine racemase activity GO:0015995,chlorophyll biosynthetic process GO:0051266,sirohydrochlorin ferrochelatase activity GO:0008881,glutamate racemase activity GO:0008781,N-acylneuraminate cytidylyltransferase activity
Is this correct ?
Thanks, Pascale
Hi @pgaudet
No, I don't think it's correct. There are too many false positives. The way I did it was to find all UniProt protein entries that have both a COG and GO annotation. I then used this information to make the cross-mappings.
One way to fix it could be to take the COG->GO mappings found in many proteins. E.g. the five most COG->GO mappings for COG0001 are (numbers are # of proteins in Uniprot with this mapping):
1235 COG:COG0001 > GO:0005737, cytoplasm
1235 COG:COG0001 > GO:0006782, protoporphyrinogen IX biosynthetic process
1344 COG:COG0001 > GO:0042286, glutamate-1-semialdehyde 2,1-aminomutase activity
1680 COG:COG0001 > GO:0030170, pyridoxal phosphate binding
1682 COG:COG0001 > GO:0008483, transaminase activity
These all seem correct: https://biocyc.org/gene?orgid=ECOLI&id=GSAAMINOTRANS-MONOMER#tab=GO
How to set the cutoff then becomes a challenge though.
Cheers,
Thanks @Russel88 Can you give me an example of a UniProt entry that has a GOC reference?
Moving discussion about COG2GO from https://github.com/geneontology/go-ontology/issues/16989
Jakob Russel developed a COG2GO mapping http://mibi.galaxy.bio.ku.dk/russel/mappings/cog2go
@cmungall proposes to load that in the go-edit file.
I tried to download the file but I downloaded a file that seems partial (because the last line, COG:COG0787, has no mapping). The file is >600K lines long; with lots of duplicates - when I remove duplicates I get 35,000 lines.
That seems like a lot ?? We should have a look at the mappings before integrating it.
For example COG:COG0156 7-keto-8-aminopelargonate synthetase or related enzyme has these mappings: COG:COG0156 > GO:0003870 5-aminolevulinate synthase activity COG:COG0156 > GO:0030170 pyridoxal phosphate binding COG:COG0156 > GO:0033014 tetrapyrrole biosynthetic process
The COG label and the mapped activity seem different.