This python code was written to rank neoantigens according to the probability that they are recognized by CD8 T cells. Large data matrices consisting of thousands of neoantigens from 131 cancer patients annotated with several feature scores are used to train machine learning classifiers that rank the neoantigens in a test set. Here we provide the python code and shell scripts that preprocess these data matrices, perform classifier training and testing, and plot the figures of our paper [1]
Install python (here we used python version 3.8.10) with the dependencies outlined in the requirements.txt file:
pip install -r requirements.txt
Edit the configure.sh file and set the environment variables NEORANKING_RESOURCE for the data directory and NEORANKING_CODE for the code directory. Source the configure.sh file:
source configure.sh
This will create the data and code directories and various subdirectories if these directories do not yet exist. Download the python code from this github repository (https://github.com/bassanilab/NeoRanking.git) and place it into the $NEORANKING_CODE directory. Download the data matrices from the links indicated here or in [1] and place the files Mutation_data_org.txt and Neopep_data_org.txt into the $NEORANKING_RESOURCE/data directory, and HLA_allotypes.txt into the $NEORANKING_RESOURCE/hla directory.
If you wish to recreate the plots for Figures 1B, S2A-C, in the paper you need to download the MmpsTestingSet.txt, MmpsTrainingSet.txt, NmersTestingSet.txt, and NmersTrainingSet.txt files from the figshare links provided by Gartner et al. [2]. These files contain mutations (nmers) and neo-peptides (mmps) together with feature scores and immunogenicity screening annotations used by Gartner et al. If you wish to recreate Figures 3D-F you need to download the file mmc5.xlsx from the Supplemental Data in Wells et al. [3]
1) Preprocess the original data matrices Mutation_data_org.txt and Neopep_data_org.txt (necessary preprocessing step to be performed once at the start of the analysis). Preprocessing consists of several steps: a) Select the SNV mutations. b) Calculate numerical encoding values for categorical features. c) Impute missing values. d) Transform values of numerical features by quantile normalization. e) Replace categories by encoded numerical values.
bash preprocess_data.sh
2) Training the classifiers. This is only required if training needs to be done on different data or repeated with different parameters. Otherwise classifier models for logistic regression and XGBoost trained on NCI-train [1] can be obtained from figshare for neo-peptides and mutations:
bash train_classifier.sh
3) Testing the classifiers:
bash test_classifier.sh
Precalculated [classifier result files](https://figshare.com/s/9fc6c11691273efe995e) used in [[1](#Citation)] can be downloaded from figshare. Place them in the ```classifier_results``` directory and respective subdirectories.
4) Plot figure X:
bash plot_figure_X.sh
The plots in Figures 3 and 4, and Suppl. Figure 4 require classifier result files (see above). If you want to reproduce the figures from the paper [[1](#Citation)] based on the results presented there, you can run the scripts as they are. If you prefer to train your own classifiers and plot the figures based on these results, please adapt the corresponding paths and regular expressions in the scripts. Some plots may look slightly different from the ones in the paper (especially very small p-values and Shapley values are subject to variations). If you retrained the classifiers, there will also be differences due to the random sampling of non-immunogenic neo-peptides and the stochastic hyperopt parameter optimization.
Copyright (C) LICR - Ludwig Institute of Cancer Research, Lausanne, Switzerland
For questions regarding code and machine learning methods, please contact Markus Müller (markus.muller@chuv.ch)
For any other questions, please contact Michal Bassani-Sternberg (michal.bassani@chuv.ch)