Validation notebook for GCA proteins Canonical proteins.

The GCA with canonical = GCA proteins + GRCh38 is under development here https://github.com/bigbio/pgt-pangenome/blob/main/gca_canonical_validation.ipynb, these are the following tasks that notebook should perform:

[x] Load the parquet files under use from the PRIDE ftp. Don't download if a local version is available.
[x] Re-map peptides to Uniprot ENSEMBL+Swissprot and ENSEMBL proteins and mark proteins as novel GCA and GRCH.
[x] Plot the distributions of scores between canonical, GCA and decoys for each category.
[x] Convert the gca and grch lists for the analysis with other notebooks and outline tools like DeepLC. Note: We should not write the code to run in the notebook DeepLC or other similar tools.
[ ] Running DeepLC outside the tool, but plot the results in the Notebook. Issue -> #6
[ ] Check if the gca peptides unique sequences are found in PeptideAtlas or GPMDB. Please Read issue, the following comment 👇 https://github.com/bigbio/pgt-pangenome/issues/5#issuecomment-1801414273

Check if peptides are including in PeptideAtlas and GPMDB:

Some of the peptides we have identified as GCA novel peptides not included in Uniprot and ENSEMBL databases are present as identified peptides in databases such as PeptideAtlas or GPMDB. For example: GASDVLLQVETIAQEHSTLSQQVDEK is identified in PeptideAtlas https://db.systemsbiology.net/sbeams/cgi/PeptideAtlas/GetPeptide?atlas_build_id=550&searchWithinThis=Peptide+Name&searchForThis=PAp03147156&action=QUERY

Taking this into account, I suggest the following steps are added to the notebook:

Check the peptides in PeptideAtlas:

All peptides from PeptideAtlas can be downloaded from: https://peptideatlas.org/tmp/PeptideAtlasInput_concat.PAidentlist.peptideSummary.gz, after getting the file the following columns will be seen:

PeptideAccession, # observations, best score, peptide sequence

We should get for all the peptides in GCA this information, for the ones found in PeptideAtlas build. Note: Not all the PSMS needs to be search, only the unique peptide sequences with no PTMs included.

GPMDB search peptides

Similar to PeptideAtlas, we should search for GPMDB peptides, in this case we can use the following API: https://rest.thegpm.org/1/peptide/count/seq=GASDVLLQVETIAQEHSTLSQQVDEK The json response will be the number of observations.

Final thoughts

The most interesting peptides are the one not previously found in PeptideAtlas or GPMDB.
For the ones found in PeptideAtlas/GPMDB with low number of observations (less than 10) we should explore them and see which is the source (proteins) of the peptides.

Please feel free to contact add your thouthgs to the discussion.

bigbio / pgt-pangenome

Validation notebook for GCA proteins Canonical proteins. #5

Check if peptides are including in PeptideAtlas and GPMDB:

Check the peptides in PeptideAtlas:

GPMDB search peptides

Final thoughts