PominovaMS / denovo_benchmarks

4 stars 9 forks source link

Benchmarking de novo peptide sequencing algorithms

Adding a new algorithm

Make a pull request to add your algorithm to the benchmarking system.

Add your algorithm in the denovo_benchmarks/algorithms/algorithm_name folder by providing
container.def, make_predictions.sh, input_mapper.py, output_mapper.py files.
Detailed files descriptions are given below.

Templates for each file implementation can be found in the algorithms/base/ folder.
It also includes the InputMapperBase and OutputMapperBase base classes for implementing input and output mappers.
For examples, you can check Casanovo and DeepNovo implementations.

Running the benchmark

To run the benchmark locally:

  1. Clone the repository:

    git clone https://github.com/PominovaMS/denovo_benchmarks.git
    cd denovo_benchmarks
  2. Build containers for algorithms and evaluation: To build all apptainer images, make sure you have apptainer installed. Then run:

    chmod +x build_apptainer_images.sh
    ./build_apptainer_images.sh

    This will build the apptainer images for all algorithms and the evaluation apptainer image.

    If an apptainer image already exists, the script will ask if you want to rebuild it.

    A .sif image for casanovo already exists. Force rebuild? (y/N) 

    If a container is missing, that algorithm will be skipped during benchmarking. We don't share or store containers publicly yet due to ongoing development and their large size.

  3. Configure paths: In order to configure the project environment to run the benchmark locally, you need to make a copy of the .env.template file and rename it to .env. This file contains the necessary environment variables for the project to run properly.

    After renaming the file, update the file paths within the .env file to reflect the correct locations on your system.

  4. Run benchmark on a dataset: Make sure the required packages are installed:

    sudo apt install squashfuse gocryptfs fuse-overlayfs  

    Run the benchmark:

    ./run.sh /path/to/dataset/dir

    Example:

    ./run.sh sample_data/9_species_human

Input data structure

The benchmark expects input data to follow a specific folder structure.

Below is an example layout for our evaluation datasets stored on the HPC:

datasets/
    9_species_human/
        labels.csv
        mgf/
            151009_exo3_1.mgf
            151009_exo3_2.mgf
            151009_exo3_3.mgf
            ...
    9_species_solanum_lycopersicum/
        labels.csv
        mgf/...
    9_species_mus_musculus/
        labels.csv
        mgf/...
    9_species_methanosarcina_mazei/
        labels.csv
        mgf/...
    ...

Note that algorithm containers only get as input the /mgf subfolder with spectra files and do not have access to the labels.csv file. Only the evaluation container accesses the labels.csv file to evaluate algorithm predictions.

Running Streamlit dashboard locally:

To view the Streamlit dashboard for the benchmark locally, run:

# If Streamlit is not installed
pip install streamlit

streamlit run dashboard.py

The dashboard reads the benchmark results stored in the results/ folder.