the origin of the Expression column in the output when not specifying RNA-seq data

MathOnco / NeoPredPipe

Neoantigens prediction pipeline for multi- or single-region vcf files using ANNOVAR and netMHCpan.

GNU Lesser General Public License v3.0

105 stars 28 forks source link

the origin of the Expression column in the output when not specifying RNA-seq data #36

Closed fangfang0906 closed 9 months ago

fangfang0906 commented 1 year ago

Hi, I wanted to express my appreciation for your exceptional work in developing the impressive tool NeoPredPipe. I have a question regarding its functionality: What is the source of the expression column when RNA-seq data is not specified? Is it inferred from WES? Thanks!

elakatos commented 1 year ago

Hi! Thank you very much. The expression column only appears if RNAseq data is inputted (using the -x option). Otherwise this column is not added to the output, as expression cannot be reliably inferred from DNAseq alone. If you want expression information added, but don't have RNAseq for your own sample, you can still consider using e.g. known cancer gene expression patterns (e.g. from TCGA) - but this should be pre-processed by you and inputted with -x.

Eszter

fangfang0906 commented 1 year ago

Hi Eszter,

Thank you for the clarification. I realized that I had mistakenly treated the "pos" column as the "expression" column when I didn't input any gene expression data. Could you please provide further explanation about the "pos" column? Thanks

elakatos commented 1 year ago

Pos is directly taken from the output of netMHCpan 4.0, see Output format on https://services.healthtech.dtu.dk/services/NetMHCpan-4.0/ . It is the position of the k-mer evaluated for binding (listed in the peptide column) within the amino-acid sequence supplied as input (those you would find in the intermediate fasta files). E.g. in the example provided in our Readme, you can see that row 4 and 5 come from the same mutation, but have different pos values, and you can also see that the peptides are related: we supplied the input ..FTHGPSSTPLHPCPF (.. standing for two additional amino acids), and the two peptides FTHGPSSTPL and SSTPLHPCPF at pos 2 and 7 are shown in the output table.

fangfang0906 commented 1 year ago

Hi, I noticed the pos column in ".neoantigens.txt" ranges from 1 to 15, which makes sense. However, it ranges from 1 to 3478 with a median of 398 in ".neoantigens.Indels.txt". Could you comment on this? Thanks!

elakatos commented 1 year ago

Hi! It is because when an indel mutation occurs, most of the time the complete reading frame shifts and therefore potentially hundreds of new peptides are generated (from the sequence after the mutation). So we test all peptides generated until the predicted stop codon of this new reading frame. For point mutations, a single amino acid will be changed, so only peptides which contain this will be novel, and we test only such peptides. The pos argument simply tells where the neoantigen peptide's start position is located with relation to all peptides tested - we start with 8-9 amino acids upstream from the one where mutation occurred.

fangfang0906 commented 1 year ago

Hi, thanks for the reply! It makes sense now. Another question is have you considered including the wide-type peptide and its binding affinity to ensure the specificity of predicted neoantigens? It'll be great to include these columns in the output file. Thank you!

elakatos commented 1 year ago

Hi! You can access that information with the pipeline now, if you run the NeoRecoPo step of the computation too: https://github.com/MathOnco/NeoPredPipe/blob/master/RecognitionPotential.md . The output of this step combines a specificity value and a similarity-to-known-pathogen value, as described in this publication: https://www.nature.com/articles/nature24473 Additionally, when you run this step, an intermediate table called "Neoantigens.WTandMTtable.txt" is produced that reports explicitly the binding affinity (not percentage rank!) of both mutant and wild-type peptide, so you can use that to define your own filtering. Just make sure to run NeoRecoPo with the -d option to keep intermediate files.