compomics / DeepLC

DeepLC: Retention time prediction for (modified) peptides using Deep Learning.
https://iomics.ugent.be/deeplc
Apache License 2.0
56 stars 19 forks source link

Error when using unmodified peptide to train and using modified peptides to test #48

Closed WeiqiangChen closed 2 years ago

WeiqiangChen commented 2 years ago

I am using the windows installed deepLC application. Train with unmodified peptide information from MaxQuant evidence.txt, and test with unmodified peptide gets good results. However, there is the following error when using unmodified peptides to train and test with modified peptides.

Traceback (most recent call last): File "pandas\core\indexes\base.py", line 3621, in get_loc return self._engine.get_loc(casted_key) File "pandas_libs\index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc cpdef get_loc(self, object val): File "pandas_libs\index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc return self.mapping.get_item(val) File "pandas_libs\hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas_libs\hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'modifications' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "deeplc\gui.py", line 38, in start_gui() File "gooey\python_bindings\gooey_decorator.py", line 134, in return lambda *args, *kwargs: func(args, kwargs) File "deeplc\gui.py", line 35, in start_gui main(gui=True) File "deeplc__main__.py", line 65, in main run(vars(argu)) File "deeplc__main.py", line 155, in run preds = dlc.make_preds(seq_df=df_pred) File "deeplc\deeplc.py", line 862, in make_preds temp_preds = self.make_preds_core( File "deeplc\deeplc.py", line 455, in make_preds_core seq_df["idents"] = seq_df["seq"] + "|" + seq_df["modifications"] File "pandas\core\frame.py", line 3505, in getitem__ indexer = self.columns.get_loc(key) File "pandas\core\indexes\base.py", line 3623, in get_loc raise KeyError(key) from err KeyError: 'modifications'*

my train csv: seq,modifications,tr ISDAGEVVAIAR,,4013.04 ATMQNLNDR,,1882.4399999999998 TTTTTTTVVTQK,,1673.8799999999999 ....... The modified peptide csv seq,modification,tr AARPLVTVYDEK,1|Acetyl,4367.64 ADFDTNPTSLYSIK,1|Acetyl,7029 AHIVQTHK,1|Acetyl,1314.48

Another modified peptide csv also got the same error. seq,modification,tr AAAESIQMR,8|Oxidation,1353 AASVGPTMR,8|Oxidation,1264.26 ADLEMQIESLK,5|Oxidation,5267.34

Is it possible to train deepLC using unmodified peptides and test with modified peptides?

RobbinBouwmeester commented 2 years ago

Dear WeiqiangChen,

That is definitely possible! However, from the example you posted the column names are wrong. In this case the column "modification" should be ""modifications". Otherwise DeepLC is looking for a column that does not exist.

If you are interested in retraining a model definitely also keep an eye on this repo: https://github.com/RobbinBouwmeester/DeepLCRetrainer

Soon I will launch that code (with GUI) that should enable retraining/transfer learning in an easier way.

Hope that helped!

Kind regards,

Robbin

WeiqiangChen commented 2 years ago

Dear Robbin,

thanks for the reply. Now deepLC works. I tried to train it with unmodified peptides from MaxQuant evidence.txt. Group_by modified_sequence, slice_max(order_by intensity, n = 1) to get the apex retention time for each modified_sequence. And the average(predicted_tr - tr) is 7.9mins. seq,modifications,tr AAAESIQMR,8|Oxidation,1353 AASVGPTMR,8|Oxidation,1264.26 ADLEMQIESLK,5|Oxidation,5267.34 image

I have tried to test also acetylated peptides. And the average(predicted_tr - tr) is now -20mins. seq,modifications,tr AARPLVTVYDEK,1|Acetyl,4367.64 ADFDTNPTSLYSIK,1|Acetyl,7029 AHIVQTHK,1|Acetyl,1314.48 AQHPLVQR,1|Acetyl,1989.42 image

Did I make some mistakes here?

RobbinBouwmeester commented 2 years ago

It could be that the current models you use are not able to extrapolate to your modifications. Could you try these models:

https://github.com/RobbinBouwmeester/DeepLCModels/blob/main/full_hc_mod_deeplc_train_filtered_1fd8363d9af9dcad3be7553c39396960.hdf5 https://github.com/RobbinBouwmeester/DeepLCModels/blob/main/full_hc_mod_deeplc_train_filtered_8c22d89667368f2f02ad996469ba157e.hdf5 https://github.com/RobbinBouwmeester/DeepLCModels/blob/main/full_hc_mod_deeplc_train_filtered_cb975cfdd4105f97efa0b3afffe075cc.hdf5

WeiqiangChen commented 2 years ago

Model2_[469ba157e.hdf5] got the best prediction for oxidated peptides when training with unmodified peptides. image

All 3 models got bad predictions for acetylated peptides when training with unmodified peptides. image

RobbinBouwmeester commented 2 years ago

I see... Now, are these n-terminal acetylated peptides? Officially we do not support terminal modifications. Although you can include them on the 0 or 1 position (it will default to the rest group of the first AA) it is likely to be suboptimal...

WeiqiangChen commented 2 years ago

Yes. These are acetylation(protein N term).

RobbinBouwmeester commented 2 years ago

Ok, that is likely to be the problem. I would recommend in that case to retrain a model with many acetylated termini, so it will "force fit" it into the current DeepLC. Feel free to contact me via e-mail (robbin.bouwmeester[at]ugent.be about the details). Can I close this issue now?