bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
986 stars 354 forks source link

Some operations are locale-aware and can produce broken VCF files #97

Closed lbeltrame closed 11 years ago

lbeltrame commented 11 years ago

GATK's CombineVariants, at least in the configuration I used for testing VarScan, is locale-aware, which means it will use either the decimal point or the comma depending on the running locale.

This in turn can produce broken VCF files, which contain commas instead of decimal points.

As the pipeline should not require a specific locale setting, LC_ALL should be overridden to C while the various processes are running.

chapmanb commented 11 years ago

Luca; What LC_ALL setting triggers the issue? I'll try to reproduce here and work on a general fix. Apologies in advance, I'm unfortunately a bit ignorant of all of the locale issues so will have to feel my way around.

lbeltrame commented 11 years ago

In data martedì 03 settembre 2013 12:09:54, Brad Chapman ha scritto:

What LC_ALL setting triggers the issue? I'll try to reproduce here and work

I run it_IT on my local session. In this case, decimal points are converted (correctly) to commas in some results (some percentages in the genotype fields of the VCF file are written as 38,4% rather than 38.4%, for example).

To fix this I have to export LC_ALL=C or unset LANG for the whole session, which is obviously not desirable.

chapmanb commented 11 years ago

Luca; I pushed fixes to the VarScan java run to use english/US locales so that the VCF output will use US-style decimal output. Can you let me know if this fixes the issue? I found a number of previous bugs related to this in VarScan commits so think it might be coming from there rather than GATK. Let me know if this doesn't fix it and we can revisit.

lbeltrame commented 11 years ago

In data venerdì 6 settembre 2013 07:30:36, Brad Chapman ha scritto:

Hello Brad,

VarScan commits so think it might be coming from there rather than GATK. Let me know if this doesn't fix it and we can revisit.

I'll have a go next Monday and report back. Feel free to ping me if I don't give any response.

lbeltrame commented 11 years ago

Sorry for not being able to test yet, I'm presenting a poster in a couple of weeks + preparing to teach a course this semester, so I'm insanely busy. I should be able to look at it next week.

chapmanb commented 11 years ago

No worries, just let us know when you get time. Good luck with all the preparation work.

lbeltrame commented 11 years ago

Confirmed fixed. Closing report.