Make a pull request to add your algorithm to the benchmarking system.
Add your algorithm in the denovo_benchmarks/algorithms/algorithm_name
folder by providing
container.def
, make_predictions.sh
, input_mapper.py
, output_mapper.py
files.
Detailed files descriptions are given below.
Templates for each file implementation can be found in the
algorithms/base/
folder.
It also includes the InputMapperBase
and OutputMapperBase
base classes for implementing input and output mappers.
For examples, you can check
Casanovo
and DeepNovo implementations.
container.def
— definition file of the Apptainer
container image that creates environment and installs dependencies required for running the algorithm.
make_predictions.sh
— bash script to run the de novo algorithm on the input dataset
(folder with MS spectra in .mgf files) and generate an output file with per-spectrum peptide predictions.
Input: path to a dataset folder containing .mgf files with spectra data
Output: output file (in a common output format) containing predictions for all spectra in the dataset
To configure the model for specific data properties (e.g. non-tryptic data, data from a particular instrument, etc.), please use dataset tags.
Current set of tags can be found in the DatasetTag
in dataset_config.py and includes nontryptic
, timstof
, waters
, sciex
.
Example usage can be found in algorithms/base/make_predictions_template.sh
.
input_mapper.py
— python script to convert input data
from its original representation (input format) to the format expected by the algorithm.
Input format
[TITLE, RTINSECONDS, PEPMASS, CHARGE]
output_mapper.py
— python script to convert the algorithm output to the common output format.
Output format
.csv file (with sep=","
)
must contain columns:
"sequence"
— predicted peptide sequence, written in the predefined output sequence format"score"
— de novo algorithm "confidence" score for a predicted sequence"aa_scores"
— per-amino acid scores, if available. If not available, the whole peptide score
will be used as a score for each amino acid."spectrum_id"
— information to match each prediction with its ground truth sequence.{filename}:{index}
string, wherefilename
— name of the .mgf file in a dataset,index
— index (0-based) of each spectrum in an .mgf file.Output sequence format
G, A, S, P, V, T, C, L, I, N, D, Q, K, E, M, H, F, R, Y, W
C[UNIMOD:4]
for Cysteine Carbamidomethylation, M[UNIMOD:35]
for Methionine Oxidation, etc.[UNIMOD:xx]-PEPTIDE-[UNIMOD:yy]
To run the benchmark locally:
Clone the repository:
git clone https://github.com/PominovaMS/denovo_benchmarks.git
cd denovo_benchmarks
Build containers for algorithms and evaluation: To build all apptainer images, make sure you have apptainer installed. Then run:
chmod +x build_apptainer_images.sh
./build_apptainer_images.sh
This will build the apptainer images for all algorithms and the evaluation apptainer image.
If an apptainer image already exists, the script will ask if you want to rebuild it.
A .sif image for casanovo already exists. Force rebuild? (y/N)
If a container is missing, that algorithm will be skipped during benchmarking. We don't share or store containers publicly yet due to ongoing development and their large size.
Configure paths:
In order to configure the project environment to run the benchmark locally, you need to make a copy of the .env.template
file and rename it to .env
. This file contains the necessary environment variables for the project to run properly.
After renaming the file, update the file paths within the .env
file to reflect the correct locations on your system.
Run benchmark on a dataset: Make sure the required packages are installed:
sudo apt install squashfuse gocryptfs fuse-overlayfs
Run the benchmark:
./run.sh /path/to/dataset/dir
Example:
./run.sh sample_data/9_species_human
The benchmark expects input data to follow a specific folder structure.
.mgf
files inside the mgf/
subfolder.labels.csv
file within each dataset folder.Below is an example layout for our evaluation datasets stored on the HPC:
datasets/
9_species_human/
labels.csv
mgf/
151009_exo3_1.mgf
151009_exo3_2.mgf
151009_exo3_3.mgf
...
9_species_solanum_lycopersicum/
labels.csv
mgf/...
9_species_mus_musculus/
labels.csv
mgf/...
9_species_methanosarcina_mazei/
labels.csv
mgf/...
...
Note that algorithm containers only get as input the /mgf
subfolder with spectra files and do not have access to the labels.csv
file.
Only the evaluation container accesses the labels.csv
file to evaluate algorithm predictions.
To view the Streamlit dashboard for the benchmark locally, run:
# If Streamlit is not installed
pip install streamlit
streamlit run dashboard.py
The dashboard reads the benchmark results stored in the results/
folder.