geneontology / go-releases

Tasks and notes for monthly GO releases
0 stars 0 forks source link

QC E coli files for new GOA-GOC data exchange pipeline #101

Open pgaudet opened 4 days ago

pgaudet commented 4 days ago

From Lisa Moore:

I looked over the three files and think they look okay. I checked a few genes and made sure the names matched the BioCyc ID, and that the GO terms that were already in EcoCyc looked the same for some genes. So, we think it is okay from our end. Cheers, Lisa

pgaudet commented 4 days ago

Compare GPIs:

GOA provides more proteins, also it contains complexes and RNAs, missing from GOC.

There are 51 entries in GOC that are not in GOA: ie they are E coli reference proteome, but have no EcoCyc mapping:

Entry

UniProtKB:A0A385XJ53 UniProtKB:A0A385XJL4 UniProtKB:P06710 UniProtKB:P07363 UniProtKB:P0ACL0 UniProtKB:P0DP21 UniProtKB:P0DP22 UniProtKB:P0DP70 UniProtKB:P0DP89 UniProtKB:P0DUM3 UniProtKB:P0DW56 UniProtKB:P21420 UniProtKB:P30192 UniProtKB:P31450 UniProtKB:P33369 UniProtKB:P33666 UniProtKB:P36667 UniProtKB:P36930 UniProtKB:P36943 UniProtKB:P37655 UniProtKB:P39347 UniProtKB:P39349 UniProtKB:P39355 UniProtKB:P39901 UniProtKB:P42905 UniProtKB:P45766 UniProtKB:P69831 UniProtKB:P75741 UniProtKB:P75901 UniProtKB:P75960 UniProtKB:P76168 UniProtKB:P76323 UniProtKB:P76335 UniProtKB:P76359 UniProtKB:P76464 UniProtKB:P76611 UniProtKB:P76616 UniProtKB:P76655 UniProtKB:P77184 UniProtKB:P77196 UniProtKB:P77286 UniProtKB:P77481 UniProtKB:P77528 UniProtKB:P77601 UniProtKB:Q46790 UniProtKB:Q47153 UniProtKB:Q47154 UniProtKB:Q47718 UniProtKB:Q59385 UniProtKB:Q7DFU6 UniProtKB:Q7DFV4

pgaudet commented 4 days ago

Emailed Lisa: I am also doing some QC on all the files, and it would be great if you could improve the following points in the GPI (which would align the GAFs better):

you have 30 ‘gene product’ objects, and 85 ‘transcripts’. Ideally these could have other, more precise types (see suggestions in the attached excel for a few examples). protein complexes should be assigned the type ‘protein complex’. Right now they are assigned ‘protein’ (see EcoCyc entries with no UniProt mappings) It would be great if you could get the RNA central mappings for RNAs (and extract ‘entity type from there) and the Complex portal IDs from complex portal. Right now, GOA is integrating this data, but they don’t have any mappings to EcoCyc. This way, any annotation that EcoCyc makes to these obsjects will be in GOA & GO central, and EcoCyc could also load the annotations coming from these sources,