bigbio / pgt-pangenome

Protegenomics analysis based on Pangenome references
BSD 2-Clause "Simplified" License
1 stars 0 forks source link

Validation notebook for GCA proteins Canonical proteins. #5

Closed ypriverol closed 3 months ago

ypriverol commented 9 months ago

The GCA with canonical = GCA proteins + GRCh38 is under development here https://github.com/bigbio/pgt-pangenome/blob/main/gca_canonical_validation.ipynb, these are the following tasks that notebook should perform:

ypriverol commented 9 months ago

Check if peptides are including in PeptideAtlas and GPMDB:

Some of the peptides we have identified as GCA novel peptides not included in Uniprot and ENSEMBL databases are present as identified peptides in databases such as PeptideAtlas or GPMDB. For example: GASDVLLQVETIAQEHSTLSQQVDEK is identified in PeptideAtlas https://db.systemsbiology.net/sbeams/cgi/PeptideAtlas/GetPeptide?atlas_build_id=550&searchWithinThis=Peptide+Name&searchForThis=PAp03147156&action=QUERY

Taking this into account, I suggest the following steps are added to the notebook:

Check the peptides in PeptideAtlas:

All peptides from PeptideAtlas can be downloaded from: https://peptideatlas.org/tmp/PeptideAtlasInput_concat.PAidentlist.peptideSummary.gz, after getting the file the following columns will be seen:

PeptideAccession, # observations, best score, peptide sequence

We should get for all the peptides in GCA this information, for the ones found in PeptideAtlas build. Note: Not all the PSMS needs to be search, only the unique peptide sequences with no PTMs included.

GPMDB search peptides

Similar to PeptideAtlas, we should search for GPMDB peptides, in this case we can use the following API: https://rest.thegpm.org/1/peptide/count/seq=GASDVLLQVETIAQEHSTLSQQVDEK The json response will be the number of observations.

Final thoughts

Please feel free to contact add your thouthgs to the discussion.