Closed ValWood closed 5 years ago
Question 3
[ ] 48 entries are described as non-functional fragments. Should these be in the reference proteome?
tr|A0A075B6V9|A0A075B6V9_HUMAN T cell receptor alpha joining 59 (non-functional) (Fragment) OS=Homo sapiens OX=9606 GN=TRAJ59 PE=4 SV=1 tr|A0A075B6Y2|A0A075B6Y2_HUMAN T cell receptor alpha joining 35 (non-functional) (Fragment) OS=Homo sapiens OX=9606 GN=TRAJ35 PE=4 SV=1 tr|A0A075B6Y4|A0A075B6Y4_HUMAN T cell receptor alpha joining 19 (non-functional) (Fragment) OS=Homo sapiens OX=9606 GN=TRAJ19 PE=4 SV=1
[ ] a further 344 are described as 'fragment" (mostly T-cell receptor alpha joining fragment) are these supposed to be included here?
Question 4
Why is the fragment from Trembl sometimes included? See this example for MIC10. The full length protein is in UniPROt and the trembl fragment is also included in the reference proteome.
sp|Q5TGZ0|MIC10_HUMAN MICOS complex subunit MIC10 OS=Homo sapiens OX=9606 GN=MINOS1 PE=1 SV=1 tr|R4GNA1|R4GNA1_HUMAN MICOS complex subunit MIC10 (Fragment) OS=Homo sapiens OX=9606 GN=MINOS1-NBL1 PE=3 SV=1
@selewis you might know something about this from PAINT work?
Who should we ask?
Question 5
Why are isoforms from Trembl included? For example
tr|A0A024R1R8|A0A024R1R8_HUMAN HCG2014768, isoform CRA_a OS=Homo sapiens OX=9606 GN=hCG_2014768 PE=4 SV=1
is https://www.uniprot.org/uniprot/Q9Y2S6 (It doesn't seem to be an isoform, its the same locus)
same for these Trembl entries:
tr|A0A024R1R8|A0A024R1R8_HUMAN HCG2014768, isoform CRA_a OS=Homo sapiens OX=9606 GN=hCG_2014768 PE=4 SV=1 tr|A0A0A6YYC7|A0A0A6YYC7_HUMAN HCG2042749, isoform CRA_d OS=Homo sapiens OX=9606 GN=ZFP91-CNTF PE=4 SV=1 tr|A0A0B4J2F2|A0A0B4J2F2_HUMAN SNF1-like kinase, isoform CRA_a OS=Homo sapiens OX=9606 GN=LOC102724428 PE=4 SV=1 tr|A0A1B0GTB2|A0A1B0GTB2_HUMAN HCG2038094, isoform CRA_a OS=Homo sapiens OX=9606 GN=TUNAR PE=4 SV=2 tr|A0A1B0GVL6|A0A1B0GVL6_HUMAN HCG1800530, isoform CRA_b OS=Homo sapiens OX=9606 GN=TMEM238L PE=4 SV=1 tr|A0A1W2PN81|A0A1W2PN81_HUMAN Neuronal acetylcholine receptor subunit alpha-7 isoform 2 precursor OS=Homo sapiens OX=9606 PE=3 SV=1 tr|A0A1W2PR95|A0A1W2PR95_HUMAN HCG1642624, isoform CRA_a OS=Homo sapiens OX=9606 GN=hCG_1642624 PE=4 SV=1 tr|A8KAH6|A8KAH6_HUMAN Heat shock 27kDa protein 2, isoform CRA_a OS=Homo sapiens OX=9606 GN=HSPB2-C11orf52 PE=2 SV=1
Question 6
lots of Trembl duplicates (417)
Most look like duplicates and have no locus information?
tr|M0QZK8|M0QZK8_HUMAN Uncharacterized protein OS=Homo sapiens OX=9606 PE=4 SV=1 tr|M0QZQ0|M0QZQ0_HUMAN Uncharacterized protein OS=Homo sapiens OX=9606 PE=4 SV=1 tr|M0QZU9|M0QZU9_HUMAN Uncharacterized protein OS=Homo sapiens OX=9606 PE=4 SV=1 tr|M0R036|M0R036_HUMAN Uncharacterized protein OS=Homo sapiens OX=9606 PE=4 SV=1 tr|M0R129|M0R129_HUMAN Uncharacterized protein OS=Homo sapiens OX=9606 PE=4 SV=1 tr|M0R143|M0R143_HUMAN Uncharacterized protein OS=Homo sapiens OX=9606 PE=4 SV=1 tr|M0R2N4|M0R2N4_HUMAN Uncharacterized protein OS=Homo sapiens OX=9606 PE=4 SV=1 tr|M0R2T5|M0R2T5_HUMAN Uncharacterized protein OS=Homo sapiens OX=9606 PE=4 SV=1
I checked a few of these and they have no locus info and look like duplicate fragments?
Added to Representing complete proteomes in GO http://wiki.geneontology.org/index.php/2018_Montreal_GOC_Meeting_Agenda#Representing_complete_proteomes_in_GO_.28added_by_Val.29
I am using the fasta file from here https://www.ebi.ac.uk/reference_proteomes because I think this is the dataset that GO uses.
I downloaded the human file, created April 2017 21042 entries.
[ ] Question 2. Why does this file contain 16 proteins flagged as pseudogenes? Is that intentional?
[ ] Why are these features present in multiple?