Roestlab / massformer

Tandem Mass Spectrum Prediction with Graph Transformers
BSD 2-Clause "Simplified" License
72 stars 25 forks source link

MassFormer

This is the original implementation of MassFormer, a graph transformer for small molecule MS/MS prediction. Check out the preprint on arXiv and the full paper in Nature Machine Intelligence.

System Requirements

This software requires a Unix-like operating system (we tested on Linux with Ubuntu 18.04).

For fast training/evaluation, a GPU compatible with CUDA 11.3 and at least 12GB VRAM is recommended.

We tested on an Intel Xeon Silver 4110 system with an Nvidia Tesla T4 (16GB VRAM) or Nvidia Quadro RTX 6000 (24GB VRAM) and 64GB system RAM.

Setting up the Environment

We recommend using conda to create an environment, then installing the packages using pip. The massformer code is configured to work with CUDA 11.3. Other versions of CUDA will likely work, but the install will require some modification. We also provide instructions for setting up the code to run on CPU only.

The total install time should not take more than a few minutes.

GPU Environment

Enter the root directory and run the following commands:

cd massformer
conda create -n MF-GPU -y
conda activate MF-GPU
conda install python=3.8 -y

Then, install the dependencies using pip:

pip install -r env/requirements-gpu.txt --extra-index-url https://download.pytorch.org/whl/cu113 -f https://data.pyg.org/whl/torch-1.12.1+cu113.html -f https://data.dgl.ai/wheels/repo.html

Finally, install massformer itself using pip:

pip install -I -e .

CPU Environment

Enter the root directory and run the following commands:

cd massformer
conda create -n MF-CPU -y
conda activate MF-CPU
conda install python=3.8 -y

Then, install the dependencies using pip:

pip install -r env/requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu -f https://data.pyg.org/whl/torch-1.12.1+cpu.html

Finally, install massformer itself using pip:

pip install -I -e .

Downloading Everything

To download everything all at once, run the following script:

bash download_scripts/download_public.sh

The total size of the download is ~2.5GB (~17GB after decompression). If you run this step, you can safely skip every step that starts with "Downloading" (this step, this step, this step, this step).

Otherwise, you may simply download files incrementally throughout the setup process.

Downloading Demo Data and Model Checkpoint

To download the data and model checkpoints, run the following command:

bash download_scripts/download_demo.sh

Running the Demo

The parameters for a pretrained MassFormer model (trained on the MoNA dataset, using an InChIKey split, using this config file) are located in checkpoints/demo.pkl. We provide a script that loads these parameters and makes predictions on a heldout subset of the MoNA dataset.

GPU Demo

To run the GPU demo, use the following command:

python scripts/run_train_eval.py -c config/demo/demo_eval.yml -w off -d 0

It should take 1-2 minutes to run.

CPU Demo

To run the CPU demo, use the following command:

python scripts/run_train_eval.py -c config/demo/demo_eval.yml -w off -d -1

It should take 5-10 minutes to run (predictions are significantly slower without a GPU).

Expected Output

The script should produce the following output:

>>> mb_na
> num_spec = 13225, num_mol = 1376, num_group = 1424, num_ce_per_group = 9.287219101123595
>>> mol_embed_params = 48267329, mlp_params = 7800000, total_params = 56067329
> primary
splits: train, val, test, total
spec: 9543, 1097, 2585, 13225
mol: 964, 138, 274, 1376
>>> no checkpoint detected
>> test
> test: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 104/104 [01:17<00:00,  1.34it/s]
> unmerged metrics: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 116.11it/s]
> merged metrics: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 15.67it/s]
> test, mol_loss_obj_mean = 0.4836

mol_loss_obj_mean is the loss averaged over molecules (instead of individual spectra) on a heldout portion of the MoNA dataset. See the config, the loss definitions, and the runner file for more detailed information about metrics.

Note: none of the steps below are required to run the demo, but are helpful for reproducing results from the paper.

Downloading the Raw Spectrum Data

Run the following scripts to download public spectral data: MoNA, CASMI 2016/2022. These scripts will add files to the data/raw directory.

bash download_scripts/download_mona_raw.sh
bash download_script/download_casmi_raw.sh

The NIST 2020 MS/MS Library is not available for download directly, but can be purchased from an authorized distributor and exported using the instructions below.

Exporting the NIST Data

Note: this step requires a Windows System or Virtual Machine.

Note: these instructions are for NIST 2020, the NIST 2023 MS/MS Library does not support export with lib2nist.

The spectra and associated compounds can be exported to MSP/MOL format using the free lib2nist software. The resulting export will contain a single MSP file with all of the mass spectra, and multiple MOL files which include the molecular structure information (linked to the spectra by ID). The screenshot below indicates appropriate lib2nist export settings.

Alt text

After exporting the files, create a directory "nist_20" in data/raw on your primary computer and save them there. If done correctly, inside "nist_20" there should be a single .MSP file with all the spectra, data/raw/nist_20/hr_nist_msms.MSP, and a directory of .MOL files, data/raw/nist_20/hr_nist_msms.MOL.

There is a minor bug in the lib2nist export software that sometimes results in errors when parsing the MOL files. To fix this bug, run the following script:

python mol_fix.py --overwrite=True

This will edit the files in data/raw/nist_20/hr_nist_msms.MOL so that they can be parsed correctly.

Preprocessing the Raw Spectrum Datasets

Core Spectrum Datasets

To parse the core spectrum datasets, NIST and MoNA, run parse_and_export.py using the following arguments:

NIST:

python preproc_scripts/parse_and_export.py --msp_file nist_20/hr_nist_msms.MSP --mol_dir nist_20/hr_nist_msms.MOL --output_name nist_df

MoNA:

python preproc_scripts/parse_and_export.py --msp_file mb_na_msms.msp --output_name mb_na_df

These scripts will create pandas dataframes (in json format) and save them in data/df. Then, run prepare_data.py to convert these json dataframes in pickle dataframes that are used for model training and inference:

python preproc_scripts/prepare_data.py

This will aggregate the NIST and MoNA data, and produce two pickle dataframes in data/proc: spec_df.pkl, which contains the spectrum information, and mol_df.pkl, which contains the structure information.

Both of these scripts using multiprocessing to speed up the computations: if your machine does not have many cores, it may be quite slow.

Spectrum Identification Datasets

To parse and preprocess the datasets for the spectrum identification tasks, run the following scripts:

CASMI 2016

python preproc_scripts/prepare_casmi_data.py

CASMI 2022

python preproc_scripts/prepare_casmi22_data.py

NIST20 Outlier (formerly pCASMI)

python preproc_scripts/prepare_pcasmi_data.py

By default, each of these scripts will save the preprocessed data in data/proc.

Note: the preprocessed data for the spectrum identification tasks (CASMI 2016, CASMI 2022, NIST20 Outlier) are also available for download, see the section below

Downloading the Preprocessed Spectrum Identification Datasets

To obtain the preprocessed data that were used for spectrum identification in the paper, run the corresponding scripts for CASMI 2016, CASMI 2022 and NIST Outlier. These scripts will add directories to the data/proc directory.

bash download_scripts/download_casmi_2016_proc.sh
bash download_scripts/download_casmi_2022_proc.sh
bash download_scripts/download_nist20_outlier_proc.sh

Downloading the CFM Predictions

To reproduce the CFM evaluation (see Training and Evaluating a Model), the CFM predictions for the NIST/MoNA compounds must first be downloaded using the CFM download script:

bash download_scripts/download_cfm_pred.sh

Setting Up Weights and Biases

Our implementation uses Weights and Biases (W&B) for logging, checkpoint management, and visualization. For full functionality, you must set up a free W&B account.

Once you set up an account and a new project, you can edit the entity_name and project_name fields in the template config to apply them globally to all configs in the repo.

Training and Evaluating a Model

The model configs for the experiments are stored in the config directory. They are organized as follows:

Configurations exist of MassFormer (MF) and the baseline methods (FP, WLN, and CFM).

To train and evaluate a model, simply choose a configuration and pass it to the run_train_eval.py script. For example, to run the mona_scaffold_all_MF experiment (train MassFormer on NIST data, and evaluate on MoNA using a scaffold split), you can use the following command:

python scripts/run_train_eval.py -c config/all_prec_type/mona_scaffold_all_MF.yml -w online

The -w argument controls the wandb logging (online, offline, or off). Note that training a model without a GPU will be very time-consuming.

The configurations in casmi perform spectrum identification evaluations. The other configurations only record similarity on a held-out test set (NIST or MoNA, depending on the configuration).

Evaluating a Pretrained Model

Assuming you have downloaded the checkpoints, you can evaluate a pretrained model using the run_train_eval.py script. However, you need to modify a few things in the config file first:

Continuing the example from the Training and Evaluating a Model section, if you want to evaluate a pretrained mona_scaffold_all_MF model using the mona_scaffold_all_MF_0.pkl checkpoint, you can modify the config to look like the following:

run_name:
data:
  num_entries: -1
  primary_dset: ["nist"]
  secondary_dset: ["mb_na"]
  ce_key: "nce"
  inst_type: ["FT"]
  frag_mode: ["HCD"]
  pos_prec_type: ['[M+H]+', '[M+H-H2O]+', '[M+H-2H2O]+', '[M+2H]2+', '[M+H-NH3]+', "[M+Na]+"]
  fp_types:   ["morgan","rdkit","maccs"]
  preproc_ce: "normalize"
  spectrum_normalization: "l1"
  gf_algos_v: "algos2"
model:
  embed_types: ["gf_v2"]
  ff_layer_type: "neims"
  ff_h_dim: 1000
  ff_num_layers: 4
  ff_skip: True
  dropout: 0.4
  output_normalization: "l1"
  bidirectional_prediction: True
  gf_model_name: "graphormer_base"
  gf_pretrain_name: "pcqm4mv2_graphormer_base"
  fix_num_pt_layers: 0
  reinit_num_pt_layers: -1
  reinit_layernorm: True
  model_seed: 6666
  checkpoint_name: mona_scaffold_MF_0 ### CHANGED
run:
  loss: "cos"
  sim: "cos"
  lr: 0.001
  batch_size: 100
  clip_grad_norm: 5.0
  scheduler: "polynomial"
  scheduler_peak_lr: 0.0002
  weight_decay: 0.001
  pt_weight_decay: 0.0
  dp: False
  dp_num_gpus: 0
  flag: True
  num_epochs: 0 ### CHANGED
  num_workers: 8
  cuda_deterministic: False
  do_test: True
  do_casmi: False
  do_pcasmi: False
  do_casmi22: False
  save_state: True
  save_media: True
  log_auxiliary: True
  train_seed: 5585
  split_seed: 420
  split_key: "scaffold"
  log_tqdm: False
  grad_acc_interval: 2 # for 24GB
  test_sets: ["train","val","mb_na"]
  save_test_sims: True
  test_frac: 0.

This modified example config is saved here. Then, you can run the following command:

python scripts/run_train_eval.py -c config/checkpoint_example/mona_scaffold_all_MF_chkpt.yml -w online

Using a Pretrained Model to Make Predictions

The inference script allows a pretrained model to make predictions on a provided set of compounds:

python scripts/run_inference.py -c config/demo/demo_eval.yml -s predictions/example_smiles.csv -o predictions/example_predictions.csv -d 0

Note: if you are using the MF-CPU environment, replace the -d 0 argument with -d -1.

The smiles file (-s argument, see this file for an example) is a csv file with two columns: the first column is a molecule id, and the second column is the SMILES string.

The output file (-o argument, see this file for an example) is a csv file with 11 columns, including the predicted spectra and various metadata. Any SMILES strings that cannot be parsed correct, or are not supported by the model, will be omitted from the output.

The precursor adducts and normalized collision energies can be controlled via command-line arguments --prec_types and --nces (respectively). Intensity transformations and normalizations can also be configured with the --transform and --normalization arguments.

Checkpoints for Models from the Manuscript

All models in the manuscript (except CFM) are trained on NIST 2020 MS/MS Library. NIST does not support redistribution of parameters for models trained on this library. As such, we cannot provide model checkpoints.

However, it should be possible to reproduce our models by following the instructions to export and preprocess the data, and training a model using the training script with the appropriate config.

If you are having trouble, feel free to create a GitHub issue or send an email to ayoung [AT] cs [DOT] toronto [DOT] edu.

Note: there are not any CFM checkpoints, since we use pre-computed CFM predictions (see previous section) in our experiments. If you want to use a pretrained CFM model on your own data, visit their website.

Training a Model on All of the Data

It may be useful to get a version of MassFormer that is trained on all available data (i.e. both NIST 2020 and MoNA spectra). We have provided config files for such models in the train_both subdirectory. There are four different configs: two for training with all supported precursor adducts (1.0 Da bin size, 0.1 Da bin size) and two for training for training with [M+H]+ precusor adducts (1.0 Da bin size, 0.1 Da bin size).