23andMe / yhaplo

Identifying Y-chromosome haplogroups in arbitrarily large samples of sequenced or genotyped men
Other
103 stars 24 forks source link

csv.Error: field larger than field limit (131072) #11

Closed rozaimirazali closed 5 years ago

rozaimirazali commented 5 years ago

My command:

callHaplogroups.py --input QGP_MALE_onlyGT.vcf.gz --ancDerCounts --haplogroupPaths --derSNPs --derSNPsDetail --ancSNPs --ancSNPsDetail

tail of the error message:

Writing trees...

Wrote tree with YCC labels: output/y.tree.ycc.2016.01.04.nwk

Wrote tree with representative-SNP labels: output/y.tree.hgSNP.2016.01.04.nwk

Wrote aligned tree with YCC labels: output/y.tree.aligned.ycc.2016.01.04.nwk

Wrote aligned tree with representative-SNP labels: output/y.tree.aligned.hgSNP.2016.01.04.nwk

Traceback (most recent call last): File "/gpfs/home/rmohamadrazali/software/yhaplo-master/callHaplogroups.py", line 45, in callHaplogroups() File "/gpfs/home/rmohamadrazali/software/yhaplo-master/callHaplogroups.py", line 24, in callHaplogroups Sample.callHaplogroups(config, tree) File "/gpfs/home/rmohamadrazali/software/yhaplo-master/sample.py", line 297, in callHaplogroups Sample.runFromVCF() File "/gpfs/home/rmohamadrazali/software/yhaplo-master/sample.py", line 399, in runFromVCF Sample.loadDataFromVCF() File "/gpfs/home/rmohamadrazali/software/yhaplo-master/sample.py", line 414, in loadDataFromVCF Sample.setSampleListFromVCFheader(vcfReader) File "/gpfs/home/rmohamadrazali/software/yhaplo-master/sample.py", line 448, in setSampleListFromVCFheader for lineList in vcfReader: _csv.Error: field larger than field limit (131072)

Why is it thinking that the input file is a csv file instead of vcf?

dpoznik commented 5 years ago

Thanks for bringing this to my attention. It looks like you have a very long line in the header of your VCF, and this is breaking csv.reader. Can you confirm that stripping out the header works as a temporary solution?

file_label=QGP_MALE_onlyGT
zcat ${file_label}.vcf.gz | grep -v '^##' > ${file_label}.no_header.vcf
callHaplogroups.py --input ${file_label}.no_header.vcf --ancDerCounts --haplogroupPaths --derSNPs --derSNPsDetail --ancSNPs --ancSNPsDetail

Assuming so, I'll patch yhaplo to make it robust to this situation.

Thanks!

dpoznik commented 5 years ago

I've just pushed a fix to support VCF files with very long metadata lines. Please try running the updated version of yhaplo on your original file, and let me know how it goes?