Noble-Lab / casanovo

De Novo Mass Spectrometry Peptide Sequencing with a Transformer Model
https://casanovo.readthedocs.io
Apache License 2.0
102 stars 37 forks source link

fine-tune the massivekb-pretrained model with additional modifications #267

Closed irleader closed 10 months ago

irleader commented 10 months ago

Hi,

In the new release of v3.5.0, "Specifying custom residues to retrain Casanovo is now possible" , so I would like to fine-tune/retrain the massivekb-pretrained model, is that possible?

In my new training dataset, there are two extra modficatioins, so I added them to config.yaml, the total AA vocabulary size increases from 29 to 31.

Usually decoder and final softmax layer's size needs to be adjusted accordingly, how can this be done in Casanovo v3.5.0?

If not possible, can I replace 2 exisiting modifications with the extra 2 modifications I want (which keeps the AA vocabulary size 29 unchanged), and retrain the massivekb-pretrained model?

I am looking forward to your reply.

Best regards

bittremieux commented 10 months ago

Unfortunately modifying the amino acid alphabet, including adding/changing modifications, requires training a new model from scratch.

As you correctly remarked, the vocabulary size needs to fit the model dimensions, and this can't be dynamically changed after a model has been trained. Changing modifications in the config file won't impact the patterns Casanovo has learned that correspond to the previous modifications, so although technically feasible, it won't give the desired results.

Instead, the new modifications should be added to the config file and a new model needs to be trained from scratch. Of course this includes having annotated spectra that have those modifications in the training data.

The notes for release 3.5 refer to a bug that would always overwrite custom modifications in the config file, using the default settings instead. This has now been fixed, so training a custom model is now possible.