gagneurlab / spectralis

10 stars 7 forks source link

Spectralis

Spectralis is a new method for de novo peptide sequencing that builds upon a new modeling task, bin reclassification. Bin reclassification assigns ion series to discretized m/z values even in the absence of a peak based on amino acid-gapped convolutional layers.

Spectralis allows the rescoring of any peptide-spectrum match (PSM, Spectralis-score), which can be used as a post-processing step of any existing de novo sequencing tool or to combine results from multiple de novo sequencing tools. Furthermore, Spectralis allows the fine-tuning of peptide-spectrum matches in an evolutionary algorithm (Spectralis-EA).

For more information see:

Installation

Prerequisites

Spectralis was trained and tested using Python 3.7 on a Linux system with a GPU. The list of required packages for running Spectralis can be found in the file requirements.txt.

Using pip and conda environments

We recommend to install and run Spectralis on a dedicated conda environment. To create and activate the conda environment run the following commands:

conda create --name spectralis_env python=3.7
conda activate spectralis_env

More information on conda environments can be found in Conda's user guide.

Spectralis requires a PyTorch installation, as indicated in the requirements file requirements.txt. However, for running Spectralis with a GPU, we recommend to install PyTorch manually to ensure compatibility between the installation and the user's GPU. For this, check PyTorch's installation guide.

To install Spectralis run the following command inside the root directory:

pip install .

Getting started

Trained models and example files can be found in the following Zenodo repository: zenodo.8393846.

Configuration

First, create a new configuration file or use the existing file stored in spectralis_config.yamlwhich contains the following features and settings:

Settings needed only for Spectralis-score and Spectralis-EA:

Settings needed only for Spectralis-EA:

The following setting should be changed only when training a bin reclassification model from scratch. Leave the settings unchanged when using the model from the Zenodo repository.

Input files

Running Spectralis

Spectralis can be run either from the command line or in a Python script.

Running Spectralis from the command line

Start by testing the Spectralis installation with:

spectralis --help

Spectralis-score

To obtain Spectralis-scores for PSMs in an .mgf file, run the following command selecting the rescoring mode (--mode=rescoring) :

spectralis --mode="rescoring" --input_path="example_mgf/example.mgf" --output_path="output_spectralis_rescoring.csv" --config="spectralis_config.yaml"

The computed scores from the input file (--input_path="<file_name>.mgf") will be stored in the specified output file (--output_path=""<file_name>.csv"). If a configuration file is not specified, the default file spectralis_config.yaml will be used.

Spectralis-EA

Similarly, to fine-tune initial PSMs with Spectralis-EA from an .mgf file, run the following command selecting the fine-tuning mode (--mode=ea):

spectralis --mode="ea" --input_path="example_mgf/example.mgf" --output_path="output_spectralis_ea.csv" --config="spectralis_config.yaml"

The fine-tuned sequences together with Spectralis-scores will be stored in the specified output file (--output_path=""<file_name>.csv").

Bin reclassification

To get predictions from the bin reclassification mode given an input .mgf file, run the following command selecting the bin reclassification mode (--mode="bin_reclassification"):

spectralis --config="spectralis_config.yaml" --mode="bin_reclassification" --input_path="example_mgf/example.mgf" --output_path="output_binreclass.hdf5"

This stores bin probabilities for singly-charged b and y ions with the corresponding m/z bins above the bin probability threshold, as well as the predicted changes and m/z bins for the input sequences in the specified .hdf5 file.

Running Spectralis in a Python script

Start running Spectralis by importing the package and creating a Spectralis object which takes as input the configuration file:

from spectralis.spectralis_master import Spectralis
spectralis = Spectralis(config_path="spectralis_config.yaml")

Spectralis-score

To obtain Spectralis-scores for PSMs in an .mgf file, run the following command:

spectralis.rescoring_from_mgf(mgf_path="example_mgf/example.mgf", out_path="spectralis_example_out.csv")

The function returns a data frame with Spectralis-scores and spectrum identifiers. The scores can be also stored in an output file specified in the out_path argument of the function.

Spectralis-EA

To fine-tune initial PSMs with Spectralis-EA from an .mgf file, run the following command:

spectralis.evo_algorithm_from_mgf(mgf_path="example_mgf/example.mgf", output_path="spectralis-ea_example_out.csv")

The function returns a data frame with the Spectralis-scores for initial and fine-tuned sequences for each spectrum identifier.

Bin reclassification

Similarly, to get predictions from the bin reclassification mode given an input .mgf file, run the following command:

binreclass_out = spectralis.bin_reclassification_from_mgf(mgf_path="example_mgf/example.mgf", out_path="output_binreclass.hdf5")
y_probs, y_mz, b_probs, b_mz, y_changes, y_mz_inputs, b_mz_inputs = binreclass_out

The function returns bin probabilities for singly-charged b and y ions with the corresponding m/z bins above the bin probability threshold, as well as the predicted changes and m/z bins for the input sequences.

Retraining models

With Spectralis, you can train a random forest model regressor or an XGBoost model to estimate the Levenshtein distance of an input to the correct peptide from scratch. For this, you can use the following function:

spectralis.train_scorer_from_csvs(train_paths,          # path containing training data stored in csv file

                                  # Column names in csv path containing peptide, precursor charge and m/z, 
                                  #     experimental spectra and levenshtein distances
                                  peptide_col, precursor_z_col, exp_mzs_col, exp_ints_col, precursor_mz_col, target_col,

                                  original_score_col,   # column in csv file indicating original scores from denovo seq tool. Default: None
                                  model_type            # "xgboost" or "rf" 
                                  model_out_path,       # path to store trained model
                                  features_out_dir,     # directory to store feature files
                                  csv_paths_eval        # path to evaluation data

                                  )

Citation

If you use Spectralis, please cite the following:

References