Clinical-Genomics / genmod

Annotate models of genetic inheritance patterns in variant files (vcf files)
http://moonso.github.io/genmod/
MIT License
74 stars 18 forks source link

bug in parsing VCF header with filters that use escape character #101

Closed hassanfa closed 3 years ago

hassanfa commented 3 years ago

Trying to do what:

genmod score -r -c cancer_rank_model_-v0.1-.ini \
  SNV.somatic.fineroughy.vardict.all.filtered.pass.vcf.gz \
  -o SNV.somatic.fineroughy.vardict.all.filtered.pass.ranked.vcf.gz

Error:

Traceback (most recent call last):
  File "/opt/conda/envs/annotate/bin/genmod", line 8, in <module>
    sys.exit(cli())
  File "/opt/conda/envs/annotate/lib/python3.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/envs/annotate/lib/python3.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/opt/conda/envs/annotate/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/envs/annotate/lib/python3.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/envs/annotate/lib/python3.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda/envs/annotate/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/opt/conda/envs/annotate/lib/python3.7/site-packages/genmod/commands/score_variants.py", line 108, in score
    head.parse_meta_data(line)
  File "/opt/conda/envs/annotate/lib/python3.7/site-packages/genmod/vcf_tools/header_parser.py", line 127, in parse_meta_data
    raise SyntaxError("One of the FILTER lines is malformed: {0}".format(line))
SyntaxError: One of the FILTER lines is malformed: ##FILTER=<ID=balsamic_high_pop_freq,Description="Set if not true: INFO/GNOMADAF_popmax <= 0.005 || INFO/GNOMADAF_popmax == \".\"">

It seems Genmod doesn't support escape characters in the filter description. According to VCF 4.2 it is a valid Description text:

The Description value must be surrounded by double-quotes. Double-quote character can be escaped with backslash \ and backslash as \\
moonso commented 3 years ago

Could explain some more? What would the entry look like? is it balsamic_high_pop_freq=\".\"?

hassanfa commented 3 years ago

We use bcftools to add filter to variants, and "Set if not true: INFO/GNOMADAF_popmax <= 0.005 || INFO/GNOMADAF_popmax == \".\"" string is autogenerated by bcftools.

Things I have tried:

  1. vcftools VCF validator passed the VCF
  2. bcftools view with various filters works fine.
  3. Parsing VCF via cyvcf2 also worked.

So it seems the genmod matches the filter string in a peculiar way that causes above issue.

Any input?

moonso commented 3 years ago

I'm not saying that genmod does it right. I'm just interested in how a entry on variant could look like

hassanfa commented 3 years ago

I see, it looks like this (I removed CSQ):

##FILTER=<ID=balsamic_low_mq,Description="Set if not true: SMPL_MIN(FMT/MQ) >= 40.0">
##bcftools_filterVersion=1.11+htslib-1.11
##bcftools_filterCommand=filter --include 'SMPL_MIN(FMT/MQ) >= 40.0' --soft-filter balsamic_low_mq --mode +; Date=Mon Feb 15 19:42:05 2021
##FILTER=<ID=balsamic_low_tumor_dp,Description="Set if not true: INFO/DP >= 100.0">
##bcftools_filterCommand=filter --include 'INFO/DP >= 100.0' --soft-filter balsamic_low_tumor_dp --mode +; Date=Mon Feb 15 19:42:05 2021
##FILTER=<ID=balsamic_low_tumor_ad,Description="Set if not true: INFO/VD >= 5.0">
##bcftools_filterCommand=filter --include 'INFO/VD >= 5.0' --soft-filter balsamic_low_tumor_ad --mode +; Date=Mon Feb 15 19:42:05 2021
##FILTER=<ID=balsamic_low_af,Description="Set if not true: INFO/AF >= 0.01">
##bcftools_filterCommand=filter --include 'INFO/AF >= 0.01' --soft-filter balsamic_low_af --mode +; Date=Mon Feb 15 19:42:05 2021
##FILTER=<ID=balsamic_af_one,Description="Set if not true: INFO/AF <  1.0">
##bcftools_filterCommand=filter --include 'INFO/AF <  1.0' --soft-filter balsamic_af_one --mode +; Date=Mon Feb 15 19:42:05 2021
##FILTER=<ID=balsamic_high_pop_freq,Description="Set if not true: INFO/GNOMADAF_popmax <= 0.005 || INFO/GNOMADAF_popmax == \".\"">
##bcftools_filterCommand=filter --include 'INFO/GNOMADAF_popmax <= 0.005 || INFO/GNOMADAF_popmax == "."' --soft-filter balsamic_high_pop_freq --mode +; Date=Mon Feb 15 19:42:05 2021
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  TUMOR   NORMAL
22      16287232        .       C       T       25      v3;f0.005;balsamic_low_mq;balsamic_low_tumor_dp;balsamic_low_tumor_ad;balsamic_low_af   STATUS=StrongLOH;SAMPLE=fineroughy;TYPE=SNV;DP=91;VD=0;AF=0;SHIFT3=0;MSI=2;MSILEN=3;SSF=0.31758;SOR=0;LSEQ=ATCCTCCCACATCCCACCTC;RSEQ=TCCCAGCCCAGGCCTGGTTA GT:DP:VD:ALD:RD:AD:AF:BIAS:PMEAN:PSTD:QUAL:QSTD:SBF:ODDRATIO:MQ:SN:HIAF:ADJAF:NM        0/0:91:0:0,0:29,62:91,0:0:2,0:25.9:1:37:0:1:0:16.6:182:1:0:0.3  0/1:118:2:0,2:54,62:116,2:0.0169:2,0:32:1:25:0:0.49935:0:20:4:0.0169:0:1
22  16287365    .   C   T   58  v3;f0.005;pSTD;balsamic_low_tumor_ad;balsamic_low_af;balsamic_high_pop_freq STATUS=LikelyLOH;SAMPLE=fineroughy;TYPE=SNV;DP=273;VD=1;AF=0.0037;SHIFT3=0;MSI=1;MSILEN=1;SSF=0.44716;SOR=0.46618;LSEQ=GGAGCTTGTCCAGATCTTCT;RSEQ=GACGGACGTGGTACCTCGGC;GNOMADAF=0.0043256;GNOMADAF_popmax=0.01463    GT:DP:VD:ALD:RD:AD:AF:BIAS:PMEAN:PSTD:QUAL:QSTD:SBF:ODDRATIO:MQ:SN:HIAF:ADJAF:NM    0/0:273:1:0,1:27,245:272,1:0.0037:2,0:13:0:37:0:1:0:48:2:0.0037:0:1 0/1:383:3:0,3:43,337:380,3:0.0078:2,0:47.3:1:37:0:1:0:24.3:6:0.0079:0:1
moonso commented 3 years ago

Aha now I understand, it is only the line in the header! The actual entry is just balsamic_high_pop_freq or nothing. Then it is easy to fix :)

hassanfa commented 3 years ago

Nice! I didn't know what was the bug referring to. 👏🏽

moonso commented 3 years ago

Can you try this branch parse-new-header-format ?

hassanfa commented 3 years ago

It worked and rankscore added to the variants. Can you please ping when you release it to pypi?

moonso commented 3 years ago

Released on pypi!