broadinstitute / viral-ngs

Viral genomics analysis pipelines
Other
187 stars 67 forks source link

snpEff requires LC_ALL=C #423

Closed haydenm closed 8 years ago

haydenm commented 8 years ago

snpEff gave me an error related to coordinates. Changing LC_ALL from en_US.utf8 to C fixed this. The issue might relate to a use of GNU sort. Perhaps we can set LC_ALL to C for just the call to snpEff?

yesimon commented 8 years ago

I kind of want to set LC_ALL=C for everything because it's a bug that trip things up relating to gnu sort, and isn't super obvious. On Thu, Jul 28, 2016 at 8:09 PM haydenm notifications@github.com wrote:

snpEff gave me an error related to coordinates. Changing LC_ALL from en_US.utf8 to C fixed this. The issue might relate to a use of GNU sort. Perhaps we can set LC_ALL to C for just the call to snpEff?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/viral-ngs/issues/423, or mute the thread https://github.com/notifications/unsubscribe-auth/AAku78X2RM55bMaGLIHTd4SEK2FS_BhEks5qaUS6gaJpZM4JXwvx .

dpark01 commented 8 years ago

I'd love to have a better understanding across the board of which steps actually depend on LC_ALL for proper operation. I'm surprised that snpEff is one of them because I had no idea that Java programs would change behavior based on that variable.

@tomkinsc do you remember if utf8 was required by any particular step or was it that it needed to be set to something/anything for deterministic behavior?

Any place we uncover a dependency on a specific LC_ALL value for proper behavior should have a defensive unit test added that fails on a different value. For now, let's just start with snpEff and deal with specific cases as we discover them.

haydenm commented 8 years ago

I'm confused now because I just tried to replicate this and could not (I did not try before posting). Here was the error I got:

2016-07-28 19:01:16,416 - snpeff:61:execute - DEBUG - /idi/sabeti-scratch/hmetsky/viral-ngs/viral-ngs-etc/conda-env/bin/snpEff -Xmx4g -Djava.io.tmpdir=/tmp/2428303.1.short/tmp-interhost-snpEff-yn_v_qrx ann -treatAllAsProteinCoding false -t -noLog -ud 0 -noStats -noShiftHgvs d40f92b7319a741b3cebf4ecaa016b7e33e118034ac3f0d15fde971 /idi/sabeti-scratch/hmetsky/viral-work/results/viral-ngs/zika/clinical-samples/zika_brazil_fiocruz-thiago/data/03_align_to_ref/B4-HS-hiseq.for_snpeff.vcf.gz [E::get_intv] failed to parse TBX_VCF, was wrong -p [type] used? The offending line was: "Error: Error while processing VCF entry (line 289) :" [E::hts_idx_push] chromosome blocks not continuous

The one change I made after seeing this was export LC_ALL=C before re-running and having it succeed. I notice now that one of the top lines of conda-env/bin/snpEff is export LC_ALL=en_US.UTF-8, so it might be involved somehow. Sorry for the confusion; I'm not sure why I can't replicate this, and I don't have anything to add to help replicate it.

dpark01 commented 8 years ago

This may be a separate issue then? Sounds as if your VCF (which you made yourself) is not sorted -- it's complaining that your variants are not grouped by chromosome

Daniel J. Park, PhD Group Leader, Viral Computational Genomics Broad Institute of MIT and Harvard Tel: +1-617-714-8526 dpark@broadinstitute.org http://www.broadinstitute.org/bios/daniel-park

On Jul 29, 2016, at 9:01 AM, haydenm notifications@github.com wrote:

I'm confused now because I just tried to replicate this and could not (I did not try before posting). Here was the error I got:

2016-07-28 19:01:16,416 - snpeff:61:execute - DEBUG - /idi/sabeti-scratch/hmetsky/viral-ngs/viral-ngs-etc/conda-env/bin/snpEff -Xmx4g -Djava.io.tmpdir=/tmp/2428303.1.short/tmp-interhost-snpEff-yn_v_qrx ann -treatAllAsProteinCoding false -t -noLog -ud 0 -noStats -noShiftHgvs d40f92b7319a741b3cebf4ecaa016b7e33e118034ac3f0d15fde971 /idi/sabeti-scratch/hmetsky/viral-work/results/viral-ngs/zika/clinical-samples/zika_brazil_fiocruz-thiago/data/03_align_to_ref/B4-HS-hiseq.for_snpeff.vcf.gz [E::get_intv] failed to parse TBX_VCF, was wrong -p [type] used? The offending line was: "Error: Error while processing VCF entry (line 289) :" [E::hts_idx_push] chromosome blocks not continuous

The one change I made after seeing this was export LC_ALL=C before re-running and having it succeed. I notice now that one of the top lines of conda-env/bin/snpEff is export LC_ALL=en_US.UTF-8, so it might be involved somehow. Sorry for the confusion; I'm not sure why I can't replicate this, and I don't have anything to add to help replicate it.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

haydenm commented 8 years ago

I think my VCF is sorted.. you can take a look. It was made by the ref_guided_consensus rule (so assembly::refine_assembly), and I pulled out just the accession number (no version) from the chromosome name so snpEff could read it. I never sorted it manually.

haydenm commented 8 years ago

Unless someone else if able to easily replicate, I'd ignore this and close it. (But keep it in mind in case this error comes up again.) I thought it was straightforward, but I guess not as I can't re-create the issue. Regardless, I haven't had any issues running snpEff through the normal snakemake pipeline under the all_intrahost rule.

tomkinsc commented 8 years ago

The conda-env/bin/snpEff wrapper doesn't require LC_ALL=en_US.UTF-8 as far as I know; it was just carried over from the varscan script it originated from. That said, using UTF-8 is good practice since English is the only language that can be correctly represented by the C charset. We do have collaborators in unicode countries (ex. Sénégal), so I don't think it's totally out of line. Tools that depends on a particular locale (i.e. those that depend on English sort order rather than unicode byte order from GNU sort) should really be enforcing locale themselves, but in cases where UTF-8 breaks things we can set the call locale to C and add a defensive unit test as Danny suggests.

dpark01 commented 8 years ago

I agree with that. UTF8 generally, C for specific tools as necessary (and I can imagine that some might need it). Let's close this issue and address it the next time we have a reproducible scenario