I recently came across your article titled "Sequence-to-sequence translation from mass spectra to peptides with a transformer model". I am very interested in your algorithm and would like to run it on my own computer. However, I have encountered some issues, and hope you can help me with them.
(1) I would like to know if there are download links available for the non-enzymatic peptide datasets from MassIVE-KB and PROSPECT mentioned in your article?
(2) I am planning to use Casanovo for de novo sequence prediction in top-down proteomics. However, I am unsure whether I need to retrain Casanovo. I have attached my top-down protein data. Could you please take a look?
The original MassIVE-KB and PROSPECT datasets are available online. Our extracted training data is currently not separately available, but you can derive it from these data sources.
The current Casanovo model will not be able to give relevant results for top-down data. It has not seen any similar data during training (e.g. high charge states, very complex spectra, etc.) and thus will not be able to handle such data during inference. To use Casanovo for top-down, you'd need to train it from scratch on a sufficiently large and high-quality dataset (1M+ spectra). However, given the complexity of the spectra and the massive search space of entire proteins, I'm not sure de novo for top-down is even realistically possible.
N.B.: Copied from an email