bioinform / neusomatic

NeuSomatic: Deep convolutional neural networks for accurate somatic mutation detection
Other
168 stars 51 forks source link

Invalid VCF format [postprocess.py] #34

Closed hdetering closed 5 years ago

hdetering commented 5 years ago

Hiya,

this may be just a minor bug but it throws off tools in downstream analysis (e.g. GATK, vcfR). When postprocess.py generated the final VCF after calling, it outputs a format that is not exactly standard-compliant:

##fileformat=VCFv4.2
##NeuSomatic Version=0.2.0
##FORMAT=<ID=SCORE,Number=1,Type=Float,Description="Prediction probability score">
##FILTER=<ID=PASS,Description="Accept as a higher confidence somatic mutation calls with probability score value at least 0.7">
##FILTER=<ID=LowQual,Description="Less confident somatic mutation calls with probability score value at least 0.4">
##FILTER=<ID=REJECT,Description="Rejected as a confident somatic mutation with probability score value below 0.4">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth in the tumor">
##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count in the tumor">
##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count in the tumor">
##FORMAT=<ID=AF,Number=1,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE
chr1    96593   .       C       T       39.9993 PASS    SCORE=0.9999;DP=48;RO=22;AO=26;AF=0.5417;       GT:DP:RO:AO:AF  0/1:48:22:26:0.5417

I had problems with the following elements:

  1. INFO fields are not declared in the header
  2. trailing semicolon in INFO column
  3. not exactly an error, but why do DP, RO, AO, AF appear both in INFO and FORMAT?

Bonus question: What's the difference between QUAL and SCORE metrics?

Cheers. -- Harry

msahraeian commented 5 years ago

@hdetering Thanks for your comment and sorry for the late reply. I fixed the VCF format in #38 and #39 . Please pull master and try again. In some use cases it may be easier to have some information in both INFO and FORMAT fields. Regarding your question about QUAL it is the Phred-scaled of SCORE. Best, Mohammad