MathOnco / NeoPredPipe

Neoantigens prediction pipeline for multi- or single-region vcf files using ANNOVAR and netMHCpan.
GNU Lesser General Public License v3.0
100 stars 28 forks source link

Wildtype output field #31

Closed Jprazich2 closed 2 years ago

Jprazich2 commented 2 years ago

Hello, I was wondering if there was an argument option to output the wildtype epitope that the mutant epitope arose from? I can't back calculate the wild type epitope even though there is an output fields for the mutation position and the reference and alternative nucleotides because for example, if the mutated amino acid's nucleotide codon was AAC and the mutated nucleotide was an A, I don't know which of the A's was the reference nucleotide. Please let me know, thanks!

elakatos commented 2 years ago

Apologies on the slow response, writing commitments got in the way...

Unfortunately there's no option for this in the basic NeoPredPipe pipeline - we only ever evaluate the mutated peptide, so do not record the wild-type one separately. However, you can use intermediate files of NeoPredPipe or NeoRecoPo depending on what exactly do you need from the wild type peptide.

1) If you run the second step of the analysis, NeoRecoPo, in that step we do evaluate the wild-type counterpart and its binding ability. This pipeline produces an intermediate file called Neoantigens.WTandMTtable.txt that contains wild-type and mutated peptide pairs (the amino acid sequence), together with their respective binding affinities. There are also samplename.wildtype.tmp.length.fasta files produced with just the sequence of the WT peptides. One thing to note is that in this step we only consider epitopes that were deemed antigenic with a binding affinity <=500 in the first neoantigen prediction step by NeoPredPipe. So a few mutated peptides might be filtered out before you'd get the WT information.

2) Alternatively, you can process the intermediate files of NeoPredPipe to retrieve the wild-type sequence: in fastaFiles, the files samplename.fasta and samplename.reformat.fasta contain one entry each for the wild-type and the mutated peptide sequence of the whole gene product, and the header of the mutated entry contains the location of the mutation. Like this: _>line112 NM001301060 c.G1006T p.G336C protein-altering (position 336 changed from G to C) MAAAGEGTPSSRGPRRDPPRRPPRNGYGVYVYPNSFFRYEGEWKAGRKHGHGKLLFKDGSYYEGAFVDGEITGEGRRHWAWSGDTFSGQFVLGEPQGYGVMEYKAGGCYEGEVSHGMREGHGFLVDRDGQVYQGSFHDNKRHGPGQMLFQNGDKYDGDWVRDRRQGHGVLRCADGSTYKGQWHSDVFSGLGSMAHCSGVTYYGLWINGHPAEQATRIVILGPEVMEVAQGSPFSVNVQLLQDHGEIAKSESGRVLQISAGVRYVQLSAYSEVNFFKVDRDNQETLIQTPFGFECIPYPVSSPAAGVPGPRAAKGGAEADVPLPRGDLELHLGALHCQEDTPGGLLGSSLF By parsing this, you could extract the mutated position (336) and use the wild-type sequence reported just in the previous entry, to get the peptide sequence with -/+ N flanking amino acids on each side. (This is something we implement to get the mutated sequence in ExtractSeq in vcf_manipulate.py, to give a starting point for the code.) This method is way more involved, but no epitopes are filtered out.

Best, Eszter