chapmanb / bcbio.variation

Toolkit to analyze genomic variation data, built on the GATK with Clojure
66 stars 15 forks source link

format of input files: java.lang.NumberFormatException: For input string: #24

Open zeneofa opened 9 years ago

zeneofa commented 9 years ago

Hi,

I am trying to compare a set of vcf files to a set of confirmed snps from a genome in a bottle database. I do not have access to the raw fastq file, so I am unsure regarding the filters applied to mapping. I merely have a set of bam files, vcf files a bed region file. I therefore also don't know what post mapping alteration have been performed.

I have have tried to run:

java -jar ~/Downloads/bcbio.variation-0.2.1-standalone.jar variant-compare ref-grading.yaml

where my ref-grading.yaml file contains the following:

dir: out: grading prep: grading/prep experiments:

I get the following error, (I am not familiar with java though):

2015-01-12 16:48:18,299 [INFO ] MLog clients using log4j logging. 2015-01-12 16:48:18,760 [INFO ] State :begin :: {:desc "Starting variation analysis"} 2015-01-12 16:48:18,788 [INFO ] State :clean :: {:desc "Cleaning input VCF: reference"} 2015-01-12 16:48:18,789 [INFO ] State :merge :: {:desc "Merging multiple input files: reference"} 2015-01-12 16:48:18,790 [INFO ] State :prep :: {:desc "Prepare VCF, resorting to genome build: reference"} "ava.lang.NumberFormatException: For input string: "14596 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:492) at java.lang.Integer.parseInt(Integer.java:527) at bcbio.align.ref$prep_bedline_sort$fn1333.invoke(ref.clj:85) at bcbio.align.ref$sort_bed_file$fn1338$fn1339$fn1344.invoke(ref.clj:98) at clojure.core$sort_by$fn4299.invoke(core.clj:2769) at clojure.lang.AFunction.compare(AFunction.java:49) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:324) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at clojure.core$sort.invoke(core.clj:2754) at clojure.core$sort_by.invoke(core.clj:2769) at clojure.core$sort_by.invoke(core.clj:2767) at bcbio.align.ref$sort_bed_file$fn1338$fn1339.invoke(ref.clj:99) at bcbio.align.ref$sort_bed_file$fn1338.invoke(ref.clj:97) at bcbio.align.ref$sort_bed_file.invoke(ref.clj:96) at bcbio.run.broad$gatk_cl_intersect_intervals$fn1816.invoke(broad.clj:56) at clojure.core$map$fn4207.invoke(core.clj:2487) at clojure.lang.LazySeq.sval(LazySeq.java:42) at clojure.lang.LazySeq.seq(LazySeq.java:60) at clojure.lang.RT.seq(RT.java:484) at clojure.core$seq.invoke(core.clj:133) at clojure.core$map$fn4207.invoke(core.clj:2479) at clojure.lang.LazySeq.sval(LazySeq.java:42) at clojure.lang.LazySeq.seq(LazySeq.java:60) at clojure.lang.RT.seq(RT.java:484) at clojure.core$seq.invoke(core.clj:133) at clojure.core$tree_seq$walk4647$fn4648.invoke(core.clj:4475) at clojure.lang.LazySeq.sval(LazySeq.java:42) at clojure.lang.LazySeq.seq(LazySeq.java:60) at clojure.lang.LazySeq.more(LazySeq.java:96) at clojure.lang.RT.more(RT.java:607) at clojure.core$rest.invoke(core.clj:73) at clojure.core$flatten.invoke(core.clj:6478) at bcbio.run.broad$gatk_cl_intersect_intervals.doInvoke(broad.clj:56) at clojure.lang.RestFn.invoke(RestFn.java:425) at bcbio.variation.filter.intervals$select_by_sample.doInvoke(intervals.clj:56) at clojure.lang.RestFn.invoke(RestFn.java:846) at bcbio.variation.combine$dirty_prep_work$run_sample_select1157.invoke(combine.clj:140) at bcbio.variation.combine$dirty_prep_work.invoke(combine.clj:155) at bcbio.variation.combine$gatk_normalize.invoke(combine.clj:187) at bcbio.variation.compare$prepare_vcf_calls$fn7526.invoke(compare.clj:120) at clojure.core$map$fn4207.invoke(core.clj:2487) at clojure.lang.LazySeq.sval(LazySeq.java:42) at clojure.lang.LazySeq.seq(LazySeq.java:60) at clojure.lang.RT.seq(RT.java:484) at clojure.lang.LazilyPersistentVector.create(LazilyPersistentVector.java:31) at clojure.core$vec.invoke(core.clj:354) at bcbio.variation.compare$prepare_vcf_calls.invoke(compare.clj:121) at bcbio.variation.compare$variant_comparison_from_config$iter75827586$fn__7587.invoke(compare.clj:255) at clojure.lang.LazySeq.sval(LazySeq.java:42) at clojure.lang.LazySeq.seq(LazySeq.java:60) at clojure.lang.RT.seq(RT.java:484) at clojure.core$seq.invoke(core.clj:133) at clojure.core$tree_seq$walk4647$fn4648.invoke(core.clj:4475) at clojure.lang.LazySeq.sval(LazySeq.java:42) at clojure.lang.LazySeq.seq(LazySeq.java:60) at clojure.lang.LazySeq.more(LazySeq.java:96) at clojure.lang.RT.more(RT.java:607) at clojure.core$rest.invoke(core.clj:73) at clojure.core$flatten.invoke(core.clj:6478) at bcbio.variation.compare$variant_comparison_from_config.invoke(compare.clj:254) at bcbio.variation.compare$_main.invoke(compare.clj:274) at clojure.lang.AFn.applyToHelper(AFn.java:161) at clojure.lang.AFn.applyTo(AFn.java:151) at clojure.core$apply.invoke(core.clj:617) at bcbio.variation.core$_main.doInvoke(core.clj:35) at clojure.lang.RestFn.applyTo(RestFn.java:137) at bcbio.variation.core.main(Unknown Source)

I have no idea how to start debuggin this, is there some input file format that I am not aware of? Must my reference.fa be truncated to the same chromosomes as indicated in the bed file?

My Aim: To get a good estimate of the false positive/negative rate, as well as possible factors influencing these (such as coverage, entropy of neigbouring regions, mapping quality etc).

Additional information: from the header of the vcf file the reference appears to be hg19 ucsc (which is what I used), it also appears that the additional chromosomes have been removed from the header and the call list in the vcf file (ie only chr1 - 22 + x +y). The ref.vcf and bed was downloaded and appear to have the same ucsc naming convension. My reference is indexed and there exists a gatk dictionary file. Java version (jdk 1.7.0_45). CentosOS, cluster with lustre file system.

Kind Regards, Piet Jones

chapmanb commented 9 years ago

Piet; Thanks for trying out bcbio.variation and for the very complete report. It looks like something is unexpected with your bed file ref.bed. Specifically, do the start/end coordinates in the file contain quotes around them? It looks like we're complaining about "14596 being present as either the start or end of one of the lines. If you clean that up, hopefully it'll continue without any issues and get you the comparison info. Hope this helps.

zeneofa commented 9 years ago

Hi Brad,

Thanks for the very quick reply. Getting my feet wet with variant calling atm :)

I have grep'ed every possible file for a quote followed by that number, but nothing. I have grep'ed without the quote and ensures that some of the, what looked like spaces, are actually tabs. But still nothing...

P

On Mon, Jan 12, 2015 at 5:13 PM, Brad Chapman notifications@github.com wrote:

Piet; Thanks for trying out bcbio.variation and for the very complete report. It looks like something is unexpected with your bed file ref.bed. Specifically, do the start/end coordinates in the file contain quotes around them? It looks like we're complaining about "14596 being present as either the start or end of one of the lines. If you clean that up, hopefully it'll continue without any issues and get you the comparison info. Hope this helps.

— Reply to this email directly or view it on GitHub https://github.com/chapmanb/bcbio.variation/issues/24#issuecomment-69583863 .

chapmanb commented 9 years ago

Piet; Would you be able to provide your BED file input as a Gist (https://gist.github.com/) or send to me directly? Maybe we'll be able to figure out the underlying issue by looking at it. Sorry to not have any better ideas right now but hopefully this'll help us get things running for you.

zeneofa commented 9 years ago

Hi Brad,

Unfortunately I can't share the bed file, the data I am using does not belong to me and I don't have permission to share it :(

Is there a specific bed format that is required, BED6 or BED12. My current bed file contains only the first three columns, and grep reveals that the offending line could be line 2 (ie contains the 1456). Is there also a header required for the bed format?

Sorry about the inconvenience with the file sharing.

Kind Regards, Piet Jones

On Wed, Jan 14, 2015 at 6:10 AM, Brad Chapman notifications@github.com wrote:

Piet; Would you be able to provide your BED file input as a Gist ( https://gist.github.com/) or send to me directly? Maybe we'll be able to figure out the underlying issue by looking at it. Sorry to not have any better ideas right now but hopefully this'll help us get things running for you.

— Reply to this email directly or view it on GitHub https://github.com/chapmanb/bcbio.variation/issues/24#issuecomment-69866626 .

chapmanb commented 9 years ago

Piet; bcbio doesn't have any special requirements for headers or columns. Where it is failing it is only trying to split the line by tabs and then take the first 3 columns, then turn the start and end coordinates into integers. I can't do much without being able to see the file but guessing: if there are strange line endings or other non-standard characters in there, maybe that is what is causing the issue. Hope this helps some.

zeneofa commented 9 years ago

Hi Brad,

Solved the problem, parsed the bed file with a python script (removing newlines and splitting the respective lines). This removed the offending item. Now however I get this:

INFO 13:52:38,856 HelpFormatter - Date/Time: 2015/01/15 13:52:38

INFO 13:52:38,856 HelpFormatter -

INFO 13:52:38,856 HelpFormatter -

INFO 13:52:39,702 GenomeAnalysisEngine - Strictness is SILENT INFO 13:52:39,768 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 WARN 13:52:39,786 FSLockWithShared$LockAcquisitionTask - WARNING: Unable to lock file /lustre/SCRATCH5/users/pjones/data_files/bcbio-variation/grading/prep/NA00001-case1-nomnp-nosv.vcf.idx because an IOException occurred with message: Function not implemented. INFO 13:52:39,788 RMDTrackBuilder - Could not acquire a shared lock on index file /lustre/SCRATCH5/users/pjones/data_files/bcbio-variation/grading/prep/NA00001-case1-nomnp-nosv.vcf.idx, falling back to using an in-memory index for this GATK run. WARN 13:52:41,002 FSLockWithShared$LockAcquisitionTask - WARNING: Unable to lock file /lustre/SCRATCH5/users/pjones/data_files/bcbio-variation/grading/prep/NA00001-reference-nomnp-nosv.vcf.idx because an IOException occurred with message: Function not implemented. INFO 13:52:41,003 RMDTrackBuilder - Could not acquire a shared lock on index file /lustre/SCRATCH5/users/pjones/data_files/bcbio-variation/grading/prep/NA00001-reference-nomnp-nosv.vcf.idx, falling back to using an in-memory index for this GATK run. INFO 13:52:43,298 IntervalUtils - Processing 64190747 bp from intervals INFO 13:52:43,370 GenomeAnalysisEngine - Preparing for traversal INFO 13:52:43,421 GenomeAnalysisEngine - Done preparing for traversal INFO 13:52:43,421 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 13:52:43,421 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 13:52:43,421 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime org.broadinstitute.gatk.utils.exceptions.UserException$BadInput: Bad input: Samples entered on command line (through -sf or -sn) that are not present in the VCF.

A list of these samples:

NA00001

To ignore these samples, run with --ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES at org.broadinstitute.gatk.tools.walkers.variantutils.SelectVariants.initialize(SelectVariants.java:365) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at bcbio.run.broad$run_gatk$fn__1805.invoke(broad.clj:34) at bcbio.run.broad$run_gatk.invoke(broad.clj:31)

On Wed, Jan 14, 2015 at 5:49 PM, Brad Chapman notifications@github.com wrote:

Piet; bcbio doesn't have any special requirements for headers or columns. Where it is failing it is only trying to split the line by tabs and then take the first 3 columns, then turn the start and end coordinates into integers. I can't do much without being able to see the file but guessing: if there are strange line endings or other non-standard characters in there, maybe that is what is causing the issue. Hope this helps some.

— Reply to this email directly or view it on GitHub https://github.com/chapmanb/bcbio.variation/issues/24#issuecomment-69936469 .

chapmanb commented 9 years ago

Piet; Thanks much for following up and for the details about the line endings. I pushed a fix which should handle this for future files by stripping off stray whitespace.

For your second problem, it looks like you used the example naming for the sample name in the input YAML (NA00001) where you probably want this to match the actual names of the samples in the VCF files. If you want bcbio.variation can fix that for you by setting fix-sample-header: true:

https://github.com/chapmanb/bcbio.variation#configuration-file

Hope this helps get you going and thanks again for all the help debugging this.