Closed ypriverol closed 3 months ago
Some of the peptides we have identified as GCA novel peptides not included in Uniprot and ENSEMBL databases are present as identified peptides in databases such as PeptideAtlas or GPMDB. For example: GASDVLLQVETIAQEHSTLSQQVDEK
is identified in PeptideAtlas https://db.systemsbiology.net/sbeams/cgi/PeptideAtlas/GetPeptide?atlas_build_id=550&searchWithinThis=Peptide+Name&searchForThis=PAp03147156&action=QUERY
Taking this into account, I suggest the following steps are added to the notebook:
All peptides from PeptideAtlas can be downloaded from: https://peptideatlas.org/tmp/PeptideAtlasInput_concat.PAidentlist.peptideSummary.gz, after getting the file the following columns will be seen:
PeptideAccession
, # observations
, best score
, peptide sequence
We should get for all the peptides in GCA this information, for the ones found in PeptideAtlas build. Note: Not all the PSMS needs to be search, only the unique peptide sequences with no PTMs included.
Similar to PeptideAtlas, we should search for GPMDB peptides, in this case we can use the following API: https://rest.thegpm.org/1/peptide/count/seq=GASDVLLQVETIAQEHSTLSQQVDEK The json response will be the number of observations.
Please feel free to contact add your thouthgs to the discussion.
The GCA with canonical = GCA proteins + GRCh38 is under development here https://github.com/bigbio/pgt-pangenome/blob/main/gca_canonical_validation.ipynb, these are the following tasks that notebook should perform:
GCA
andGRCH
.GCA
and decoys for each category.gca
andgrch
lists for the analysis with other notebooks and outline tools like DeepLC. Note: We should not write the code to run in the notebook DeepLC or other similar tools.