effidotpy / mito-variant-calling

GNU General Public License v3.0
0 stars 0 forks source link

Whitespace in the gnomAD INFO field #11

Open effidotpy opened 8 months ago

effidotpy commented 8 months ago

When trying to annotate with Haplogrep2, it complains because INFO contains whitespace from gnomAD. Source gnomAD VCF already contains this, so I need to fix it at the Dockerfile.

Exception is as follows:

(base) dani@toshiba:/tmp/borrar$ ./haplogrep classify --in annotated_gnomad.vcf --out annotated_haplogrep2.vcf --format vcf --extend-report  
mtDNA Haplogroup Classifiction v2.4.0
https://github.com/seppinho/haplogrep-cmd
(c) Sebastian Schönherr, Hansi Weissensteiner, Lukas Forer, Dominic Pacher
sebastian.schoenherr@i-med.ac.at

[classify, --in, annotated_gnomad.vcf, --out, annotated_haplogrep2.vcf, --format, vcf, --extend-report]
phylotree17_FU1.xml
Parameters:
Input format: vcf
Phylotree version: 17_FU1
Reference: rCRS
Extended report: true
Skip alignment rules: false
Used metric: kulczynski
Chip array data: false
Lineage: 0

Start Classification...
htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 240: The VCF specification does not allow for whitespace in the INFO field. Offending field value was "AS_FilterStatus=SITE;AS_SB_TABLE=2087,2353|10,14;DP=5571;ECNT=1;MBQ=39,40;MFRL=180,179;MMQ=60,60;MPOS=20;OCM=0;POPAF=2.40;RPA=8,7;RU=A;STR;TLOD=2.72;vep=[-|frameshift_variant|HIGH|MT-ND5|ENSG00000198786|Transcript|ENST00000361567|protein_coding|1/1||ENST00000361567.2:c.89del|ENSP00000354813.2:p.Asn30ThrfsTer7|82|82|28|K/X|Aaa/aa|1||1|deletion||HGNC|HGNC:7461|YES||P1||ENSP00000354813||||1|||ENSP_mappings:5xtc&ENSP_mappings:5xtd&ENSP_mappings:5xth&ENSP_mappings:5xti&ENSP_mappings:5xti&PANTHER:PTHR42829&PANTHER:PTHR42829&TIGRFAM:TIGR01974|7|||||HC|||PERCENTILE:0.0491169977924945, GERP_DIST:-3294.1159532547, BP_DIST:1730, DIST_FROM_LAST_EXON:-81, 50_BP_RULE:FAIL, PHYLOCSF_TOO_SHORT];AN=56299;AC_hom=0;AC_het=4", for input source: file:///tmp/borrar/annotated_gnomad.vcf
    at htsjdk.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:883)
    at htsjdk.variant.vcf.AbstractVCFCodec.parseInfo(AbstractVCFCodec.java:515)
    at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:428)
    at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:384)
    at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:328)
    at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:48)
    at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70)
    at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37)
    at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:375)
    at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:354)
    at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:315)
    at importer.VcfImporter.load(VcfImporter.java:46)
    at genepi.commands.HaplogrepCommand.call(HaplogrepCommand.java:156)
    at genepi.commands.HaplogrepCommand.call(HaplogrepCommand.java:20)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
    at picocli.CommandLine.access$1300(CommandLine.java:145)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
    at picocli.CommandLine.execute(CommandLine.java:2078)
    at genepi.App.main(App.java:59)
effidotpy commented 8 months ago

It turns out it is not because of the gnomAD VCF file but because of VarNote inserting these whitespaces. I think that's Varnote's behavior when there it finds commas within a field. I prefer not not change source info, so I include another step in the pipeline to replace , values by , .