geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
45 stars 89 forks source link

human non redundant gene sets topic for GOC meeting (QforO reference proteome set_ #816

Closed ValWood closed 5 years ago

ValWood commented 6 years ago

I am using the fasta file from here https://www.ebi.ac.uk/reference_proteomes because I think this is the dataset that GO uses.

I downloaded the human file, created April 2017 21042 entries.

ValWood commented 6 years ago

Question 3

ValWood commented 6 years ago

Question 4

Why is the fragment from Trembl sometimes included? See this example for MIC10. The full length protein is in UniPROt and the trembl fragment is also included in the reference proteome.

sp|Q5TGZ0|MIC10_HUMAN MICOS complex subunit MIC10 OS=Homo sapiens OX=9606 GN=MINOS1 PE=1 SV=1 tr|R4GNA1|R4GNA1_HUMAN MICOS complex subunit MIC10 (Fragment) OS=Homo sapiens OX=9606 GN=MINOS1-NBL1 PE=3 SV=1

ValWood commented 6 years ago

@selewis you might know something about this from PAINT work?

ValWood commented 6 years ago

Who should we ask?

ValWood commented 6 years ago

Question 5

Why are isoforms from Trembl included? For example

tr|A0A024R1R8|A0A024R1R8_HUMAN HCG2014768, isoform CRA_a OS=Homo sapiens OX=9606 GN=hCG_2014768 PE=4 SV=1

is https://www.uniprot.org/uniprot/Q9Y2S6 (It doesn't seem to be an isoform, its the same locus)

same for these Trembl entries:

tr|A0A024R1R8|A0A024R1R8_HUMAN HCG2014768, isoform CRA_a OS=Homo sapiens OX=9606 GN=hCG_2014768 PE=4 SV=1 tr|A0A0A6YYC7|A0A0A6YYC7_HUMAN HCG2042749, isoform CRA_d OS=Homo sapiens OX=9606 GN=ZFP91-CNTF PE=4 SV=1 tr|A0A0B4J2F2|A0A0B4J2F2_HUMAN SNF1-like kinase, isoform CRA_a OS=Homo sapiens OX=9606 GN=LOC102724428 PE=4 SV=1 tr|A0A1B0GTB2|A0A1B0GTB2_HUMAN HCG2038094, isoform CRA_a OS=Homo sapiens OX=9606 GN=TUNAR PE=4 SV=2 tr|A0A1B0GVL6|A0A1B0GVL6_HUMAN HCG1800530, isoform CRA_b OS=Homo sapiens OX=9606 GN=TMEM238L PE=4 SV=1 tr|A0A1W2PN81|A0A1W2PN81_HUMAN Neuronal acetylcholine receptor subunit alpha-7 isoform 2 precursor OS=Homo sapiens OX=9606 PE=3 SV=1 tr|A0A1W2PR95|A0A1W2PR95_HUMAN HCG1642624, isoform CRA_a OS=Homo sapiens OX=9606 GN=hCG_1642624 PE=4 SV=1 tr|A8KAH6|A8KAH6_HUMAN Heat shock 27kDa protein 2, isoform CRA_a OS=Homo sapiens OX=9606 GN=HSPB2-C11orf52 PE=2 SV=1

ValWood commented 6 years ago

Question 6

lots of Trembl duplicates (417)

Most look like duplicates and have no locus information?

tr|M0QZK8|M0QZK8_HUMAN Uncharacterized protein OS=Homo sapiens OX=9606 PE=4 SV=1 tr|M0QZQ0|M0QZQ0_HUMAN Uncharacterized protein OS=Homo sapiens OX=9606 PE=4 SV=1 tr|M0QZU9|M0QZU9_HUMAN Uncharacterized protein OS=Homo sapiens OX=9606 PE=4 SV=1 tr|M0R036|M0R036_HUMAN Uncharacterized protein OS=Homo sapiens OX=9606 PE=4 SV=1 tr|M0R129|M0R129_HUMAN Uncharacterized protein OS=Homo sapiens OX=9606 PE=4 SV=1 tr|M0R143|M0R143_HUMAN Uncharacterized protein OS=Homo sapiens OX=9606 PE=4 SV=1 tr|M0R2N4|M0R2N4_HUMAN Uncharacterized protein OS=Homo sapiens OX=9606 PE=4 SV=1 tr|M0R2T5|M0R2T5_HUMAN Uncharacterized protein OS=Homo sapiens OX=9606 PE=4 SV=1

I checked a few of these and they have no locus info and look like duplicate fragments?

ValWood commented 6 years ago

Added to Representing complete proteomes in GO http://wiki.geneontology.org/index.php/2018_Montreal_GOC_Meeting_Agenda#Representing_complete_proteomes_in_GO_.28added_by_Val.29