bigbio / pgt-pangenome

Protegenomics analysis based on Pangenome references
BSD 2-Clause "Simplified" License
1 stars 0 forks source link

Discussion about Uniprot extended proteomes. #7

Closed ypriverol closed 4 months ago

ypriverol commented 10 months ago

@husensofteng @DongdongdongW :

I was checking some of the novel peptides, for example: FYPQSLQLTWLENGNVCQR this maps to protein https://www.uniprot.org/uniprotkb/A8K9N0/entry that protein is not in uniprot proteomes, the 82k proteins you have there but a uniprot file that contains 229959 proteins, talking to the Uniprot teams those proteins include extended version of uniprot for example predicted proteins, with low annotation score etc. Im wondering what do do with those peptides? How do we want to analyze them?

I will talk to Uniprot team to know if we can programmatically know from where these extended proteins come from.

ypriverol commented 10 months ago

Actually, update in this topic, the actual database of all human proteins in Uniprot including predicted etc, can be dwnloaded from here: https://rest.uniprot.org/uniprotkb/stream?compressed=false&format=fasta&includeIsoform=true&query=%28Human%29+AND+%28model_organism%3A9606%29

We should search our GCA peptides against that database and in those cases we found a protein with a GCA peptide, we should from the description the field PE= this is the level of annotations in Uniprot (https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/pe_criteria.txt), like transcript level information, predicted or the level of annotations. We should get a plot with distribution, similar to the peptideatlas one that have the following categories:

No mapping to Uniprot, PE=1, PE=2, PE=3, PE=4, PE=5: the values will be the number of peptides that maps to proteins within each category. For example, if a peptide do not map to any protein in the full human proteome, then count it, if the peptide map to a protein with PE=1, count it, an soon on.

@DongdongdongW we can add this analysis to the PeptideAtlas + GPMDB notebook.

ypriverol commented 10 months ago

I had a discussion with the Uniprot team. This extended proteome are proteins translated from RNA submissions to ENA. Basically, we will be able to find some of the GCA proteins there.