Background:
In FILTER, multiple filters should be separated by semicolons. The widely used, but not actively maintained, VarScan2 genomic variant caller uses commas instead. Moreover, VarScan2 does not add ##FILTER metadata for most of its filters. Picard FixVcfHeader can be used to fix missing FILTER metadata. A "fixed" metadata row will look like:
FILTER=<ID="RefAvgRL,VarAvgRL",Description="Missing description: this FILTER line was added by Picard's FixVCFHeader">
Error:
PyVCF fails with:
`
Traceback (most recent call last):
File "/mnt/hdd/dnanexus/scripts_local/compare_vcfs.py", line 236, in
main()
File "/mnt/hdd/dnanexus/scripts_local/compare_vcfs.py", line 232, in main
run(parser.parse_args())
File "/mnt/hdd/dnanexus/scripts_local/compare_vcfs.py", line 166, in run
df_1 = vcf_to_dataframe(args.vcf_1)
File "/mnt/hdd/dnanexus/scripts_local/compare_vcfs.py", line 74, in vcf_to_dataframe
vcf_reader = vcf.Reader(open(vcf_file, "r"))
File "/home/myourshaw/.venv/dnanexus/lib/python3.10/site-packages/vcf/parser.py", line 300, in init
self._parse_metainfo()
File "/home/myourshaw/.venv/dnanexus/lib/python3.10/site-packages/vcf/parser.py", line 326, in _parse_metainfo
key, val = parser.read_filter(line)
File "/home/myourshaw/.venv/dnanexus/lib/python3.10/site-packages/vcf/parser.py", line 142, in read_filter
raise SyntaxError(
SyntaxError: One of the FILTER lines is malformed: ##FILTER=<ID="RefAvgRL,VarAvgRL",Description="Missing description: this FILTER line was added by Picard's FixVCFHeader">
`
Issue:
It might be more robust for PyVCF to treat a filter with commas as just one big filter name, as does Picard FixVcfHeader.
Instead of raising an exception, accept metadata with a filter ID inside double quotes and containing commas, e.g., ID="RefAvgRL,VarAvgRL".
Similarly, in the data, treat a FILTER value like RefAvgRL,VarAvgRL as a single entity. I think this solution is consistent with the VCF 4.2 spec for a filter name: String, no whitespace or semicolons permitted.
Possible pull request:
This hack (changing [^,] + to .+ worked to get me through an urgent analysis, but it may not be the best solution. At parser.py line 142
self.filter_pattern = re.compile(r'''##FILTER=< ID=(?P.+),\s Description="(?P[^"])" >''', re.VERBOSE)
=======
I get the same problem, any update on this issue ?
Issue from https://github.com/jamescasbon/PyVCF/issues/337
Background: In FILTER, multiple filters should be separated by semicolons. The widely used, but not actively maintained, VarScan2 genomic variant caller uses commas instead. Moreover, VarScan2 does not add ##FILTER metadata for most of its filters. Picard FixVcfHeader can be used to fix missing FILTER metadata. A "fixed" metadata row will look like:
FILTER=<ID="RefAvgRL,VarAvgRL",Description="Missing description: this FILTER line was added by Picard's FixVCFHeader">
Error: PyVCF fails with: ` Traceback (most recent call last): File "/mnt/hdd/dnanexus/scripts_local/compare_vcfs.py", line 236, in main()
File "/mnt/hdd/dnanexus/scripts_local/compare_vcfs.py", line 232, in main run(parser.parse_args())
File "/mnt/hdd/dnanexus/scripts_local/compare_vcfs.py", line 166, in run df_1 = vcf_to_dataframe(args.vcf_1)
File "/mnt/hdd/dnanexus/scripts_local/compare_vcfs.py", line 74, in vcf_to_dataframe vcf_reader = vcf.Reader(open(vcf_file, "r"))
File "/home/myourshaw/.venv/dnanexus/lib/python3.10/site-packages/vcf/parser.py", line 300, in init self._parse_metainfo()
File "/home/myourshaw/.venv/dnanexus/lib/python3.10/site-packages/vcf/parser.py", line 326, in _parse_metainfo key, val = parser.read_filter(line)
File "/home/myourshaw/.venv/dnanexus/lib/python3.10/site-packages/vcf/parser.py", line 142, in read_filter raise SyntaxError(
SyntaxError: One of the FILTER lines is malformed: ##FILTER=<ID="RefAvgRL,VarAvgRL",Description="Missing description: this FILTER line was added by Picard's FixVCFHeader"> `
Issue: It might be more robust for PyVCF to treat a filter with commas as just one big filter name, as does Picard FixVcfHeader. Instead of raising an exception, accept metadata with a filter ID inside double quotes and containing commas, e.g., ID="RefAvgRL,VarAvgRL". Similarly, in the data, treat a FILTER value like RefAvgRL,VarAvgRL as a single entity. I think this solution is consistent with the VCF 4.2 spec for a filter name: String, no whitespace or semicolons permitted.
Possible pull request: This hack (changing [^,] + to .+ worked to get me through an urgent analysis, but it may not be the best solution. At parser.py line 142 self.filter_pattern = re.compile(r'''##FILTER=< ID=(?P.+),\s Description="(?P[^"] )" >''', re.VERBOSE)
=======
I get the same problem, any update on this issue ?
I hoped switching to PyVCF3 (c.f. https://github.com/jamescasbon/PyVCF/issues/335 ) would solve the issue but apparently not.
My bad, in my case the problem originated from a tag Source in a FILTER field:
FILTER=<ID=xxx,Description="yyy",Source="zzz">
which is a INFO field tag according to https://samtools.github.io/hts-specs/ and not a FILTER field tag.