Noble-Lab / casanovo

De Novo Mass Spectrometry Peptide Sequencing with a Transformer Model
https://casanovo.readthedocs.io
Apache License 2.0
116 stars 39 forks source link

Download non-enzyme data; apply to top-down data #277

Closed wsnoble closed 10 months ago

wsnoble commented 11 months ago

N.B.: Copied from an email

I recently came across your article titled "Sequence-to-sequence translation from mass spectra to peptides with a transformer model". I am very interested in your algorithm and would like to run it on my own computer. However, I have encountered some issues, and hope you can help me with them.

(1) I would like to know if there are download links available for the non-enzymatic peptide datasets from MassIVE-KB and PROSPECT mentioned in your article?

(2) I am planning to use Casanovo for de novo sequence prediction in top-down proteomics. However, I am unsure whether I need to retrain Casanovo. I have attached my top-down protein data. Could you please take a look?

bittremieux commented 11 months ago
  1. The original MassIVE-KB and PROSPECT datasets are available online. Our extracted training data is currently not separately available, but you can derive it from these data sources.
  2. The current Casanovo model will not be able to give relevant results for top-down data. It has not seen any similar data during training (e.g. high charge states, very complex spectra, etc.) and thus will not be able to handle such data during inference. To use Casanovo for top-down, you'd need to train it from scratch on a sufficiently large and high-quality dataset (1M+ spectra). However, given the complexity of the spectra and the massive search space of entire proteins, I'm not sure de novo for top-down is even realistically possible.