MathOnco / NeoPredPipe

Neoantigens prediction pipeline for multi- or single-region vcf files using ANNOVAR and netMHCpan.
GNU Lesser General Public License v3.0
100 stars 28 forks source link

how to set -c option? #33

Closed XiangweiZhai closed 2 years ago

XiangweiZhai commented 2 years ago

Hi, I am confused about option -c. In the readme file, the -c option is described as the column position of the normal sample within a multiregion vcf file. The head of test1.vcf:

CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | N | T1 | T2

I used the following code to run the Pipeline: python NeoPredPipe.py -I ./Example/input_vcfs -H ./Example/HLAtypes/hlatypes.txt -o ./ -n TestRun -c 1 2 -E 8 9 10 No error message during code execution. The result is : Sample | R1 | R2 | Line | chr | allelepos | ref | alt test1 | 0 | 1 | line16 | chr1 | 153914523 | G | C test1 | 0 | 1 | line16 | chr1 | 153914523 | G | C test1 | 0 | 1 | line31 | chr2 | 175268887 | C | T I only get two regions(R1 and R2), but there are three regions(R1,R2,R3) in the example. It seems that the first column is not be recognized as normal tissue by default. If I used -c 0 1 2 instead of -c 1 2,the position of 0 corresponds to the column position of normal tissue,I got this: Sample | R1 | R2 | R3 | Line | chr | allelepos | ref | alt test1 | 0 | 0 | 1 | line16 | chr1 | 153914523 | G | C test1 | 0 | 0 | 1 | line16 | chr1 | 153914523 | G | C test1 | 0 | 0 | 1 | line31 | chr2 | 175268887 | C | T test1 | 0 | 1 | 0 | line45 | chr5 | 140559195 | C | T This result shows three regions R1,R2 and R3, and the value of the column R1 is all 0,the possible reason is that normal tissue does not carry mutations,but the example result in README.md shows that the value of R3 column is all 0. I think the result I got seems more reasonable.

In fact, the problem I have in practice is that the position of the normal tissue in my VCF file is in the middle of other tissues,for example:

CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | primary1 | primary2 | normal | metastatic1 | metastatic2

can I set the parameter like this:-c 1 2 0 3 4 Looking forward to your reply,thanks!

rschenck commented 2 years ago

In summary, the -c option is determining which of the genotyping fields to use in the vcf file that ARE NOT NORMAL.

It has to match the VCF you feed it. You should be able to set these in any order and you shouldn't give it the index of the normal. If you do this it will attempt to use that information for neoantigen prediction output information. I did this because some vcf files for multi region variant calling or downstream analysis don't always put the normal sample first in the genotyping columns.

If you specify different columns in the examples than what is specified I'd expect it not to work properly.

Hope this answers your question. Feel free to reopen the issue with your response if this didn't clarify the -c option.

XiangweiZhai commented 2 years ago

Thank you very much for your reply! I think I see what you mean. In most cases we may only be interested in some of the samples in multi region VCF file. So -c option is determining which columns within vcf that are not normal but we are interested in, right? But some new problems emerging: for test1.vcf:

CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | N | T1 | T2

Why not use -c 2 3 to indicate T1 and T2, but use -c 1 2 instead? I guess there may be two reasons:

  1. You may be using the LINUX array structure to organize your data. So the indexing start with zero that means 0, 1, 2 indicate N, T1, T2 respectively.
  2. The genotypes of normal tissue are all 0/0,so normal tissue column should be ignored and start counting from T1. Which of the two is correct, or there are other reasons. It's important for me to set a right -c option.