compomics / DeepLC

DeepLC: Retention time prediction for (modified) peptides using Deep Learning.
https://iomics.ugent.be/deeplc
Apache License 2.0
56 stars 19 forks source link

How to configure support for selenocysteine in its carbamidomethylated form? #62

Closed KarlClauser closed 1 year ago

KarlClauser commented 1 year ago

Selenocysteine (symbol Sec or U) is encoded in the human proteome in 72 instances. Selenocysteine is an analogue of the more common cysteine with selenium in place of the sulfur. Sec reacts like Cys with common reduction/alkylation chemistries using iodoacetamide to create the carbamidomethyl form. In a list of HLA-I peptides from an immunopeptidomics experiments submitted to DeepLC I had one, IEHCTSURVY SELENOH selenoprotein H ENSP00000373509.4 with no modification specified. It caused a python call to DeepLC to crash with the following error 1. Which I think means selenium is not in the DeepLC list of elements, but the elemental composition for amino acid U is present.

So I looked into DeepLC code and modified the following to support amino acid U and element Se: C:\ProgramData\Anaconda3\Lib\site-packages\deeplc\feat_extractor.py C:\ProgramData\Anaconda3\Lib\site-packages\deeplc\aa_comp_rel.csv

This led to error 2 below. The 6 columns are for the original elements C,H,N,O,S,P, the 7th column was for the addition of Se. I think this means that there won't be support for the element Se without suitable training data. But due to the rarity of Selenocysteine it seems unlikely that there will ever be enough examples to enable training.

So to avoid crashes I went back to the input sequence and changed the U to a C, IEHCTSCRV with the C's specified as Carbamidomethyl, as chemically that was the closest I could get. Do you have any other suggestions?

  File "C:\ProgramData\Anaconda3\lib\site-packages\deeplc\feat_extractor.py", line 525, in encode_atoms
    matrix_pos[pn, dict_index_pos[atom]] = val
KeyError: 'Se'

Error 2
    File "C:\ProgramData\Anaconda3\lib\site-packages\keras\engine\input_spec.py", line 295, in assert_input_compatibility
        raise ValueError(
    ValueError: Input 0 of layer "model_179" is incompatible with the layer: expected shape=(None, 60, 6), found shape=(None, 60, 7) 
RobbinBouwmeester commented 1 year ago

Dear Karl,

Currently this is unsupported, the issue would (I think) also be to get sufficient examples with Se.

If you want to add Se to the elements it would probably be easiest to adjust the code in the retrainer (https://github.com/RobbinBouwmeester/DeepLCRetrainer/tree/main/deeplcretrainer). I can give you some pointers there if you would like to make changes there. It might not be the best time investment though due to a small number of examples with Se.

What I will do is catch error that is not going to end the execution of DeepLC when an amino acid contains Se. This is already done for modifications, but not for amino acids (with or without modifications).

Kind regards,

Robbin

KarlClauser commented 1 year ago

I think catching the error is the only practical solution. I have only the one U containing peptide so far from immunopeptidomics work. There are a few more in tryptic proteome datasets. I'll check on how close the RT predictions are with Cys virtual substitution. I expect they will typically be outliers.