Closed smboyle closed 8 years ago
This also happens when I use the most recent git build with last commit: 9e0365e (HEAD, v1.4.5, origin/master, origin/HEAD, master) Version 1.4.5 prepared
The error message changes slightly to the following, but breaks at the same location in the bam: java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) at java.util.concurrent.FutureTask.get(FutureTask.java:111) at com.astrazeneca.vardict.VarDict.somaticParallel(VarDict.java:338) at com.astrazeneca.vardict.VarDict.nonAmpVardict(VarDict.java:251) at com.astrazeneca.vardict.VarDict.start(VarDict.java:65) at com.astrazeneca.vardict.Main.run(Main.java:134) at com.astrazeneca.vardict.Main.main(Main.java:25) Caused by: java.util.concurrent.ExecutionException: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) at java.util.concurrent.FutureTask.get(FutureTask.java:111) at com.astrazeneca.vardict.VarDict$SomdictWorker.call(VarDict.java:5356) at com.astrazeneca.vardict.VarDict$SomdictWorker.call(VarDict.java:5339) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.charAt(String.java:695) at com.astrazeneca.vardict.VarDict.parseSAM(VarDict.java:1895) at com.astrazeneca.vardict.VarDict.toVars(VarDict.java:2551) at com.astrazeneca.vardict.VarDict$ToVarsWorker.call(VarDict.java:5334) at com.astrazeneca.vardict.VarDict$ToVarsWorker.call(VarDict.java:5308)
Hi @smboyle, thanks for the bug report and sorry about the issue. Before we start digging into the bug, can I ask if you've tried running the same region in the raw bam file without any of the Gatk tools? We recommend not using any Gatk tools like recalibration, realignment and especially not the spliced alignment splitting tool.
Hello @mjafin: Thank you for the quick response. This is very helpful, as we thought it would be best to perform these operations. Does Verdict perform these (or similar) operations itself? I have only tried the raw star aligned bam for a few of these cases, but in those instances I did not observe the errors. We will go forward in this direction and I will let you know if we see any other errors.
Additionally, what is your stance on removing duplicate reads, for example with picard's MarkDuplicates tool?
@smboyle if you have fairly recently generated data I don't see any reason to use BQSR. VarDict performs realignment around indels intrinsically so no need for Gatk realignment either. Regarding marking duplicates, I'm not sure what people usually do with RNA-seq data. We mark duplicates in DNA hybrid capture and WGS but haven't been marking in RNA-seq. Using Picard or samblaster would do the job but you might lose a lot of data in the process.
Let us know if you get the StringIndexOutOfBoundsException
with the Star bams and we'll try to fix any remaining issues.
You can also use VarDict via bcbio if you don't fancy writing your own wrappers.
@smboyle I'll close this issue but if you encounter problems with raw STAR bam files let us know and we'll investigate further.
Hello,
I have been trying to apply the java version of your verdict program on a set of tumor/normal samples that have been modified according to GATK best practices for RNA (Recal, Realign, dedup, etc.).
The samples run correctly for the perl version of your pipeline, however, they are breaking at very specific regions in the recal realigned bam file. For example, it breaks in all samples at the KLF6 gene. There does not appear to be anything strange about this region when inspecting the bam (~2,000x coverage and no easily observed variants in IGV - Similar to other regions that pass). However, it always gets hung up here in the java version.
Have you observed this in the past? Do you have any suggestions on what I could do to avoid this?
Here is an example of my command:
VarDict -th 4 \ -G hs37d5.fa \ -N {Samp}_spiked_UC3_RNA \ -b "StarAligned_RDSQ_Recal_Final.bam|Core_DNA_Normal_Merged.bam" \ -C -c 1 -S 2 -E 3 \ target_regions/10a.bed
hs37d5 is effectively b37 + decoys. The region in question is on chr 10.
And here is the resulting error position:
1 30.9 1 60.0 26.000 0.3171 0 1.5 0 1.000 1 GAACTGCACGCTAGGGAAGG GAATGACCAGAACGCAAAAG 10:3208380-3208592 Deletion SNV HCC1187_spiked_UC3_RNA 10 10 3212392 3212392 T A 169 2 49 118 1 1 T/A 0.0118 2;2 7.0 0 34.0 1 60.0 4.000 0.0127 0 1.0 260 0 127 133 0 0 T/A 0 2;0 48.0 0 16.0 0 60.0 0.000 0 0 2.0 0 2.000 2 CATCGCCACGCTCTGTGGTG GCATGTCTAGATTAAAAGTC 10:3212295-3212472 StrongSomatic SNV HCC1187_spiked_UC3_RNA 10 10 3214369 3214369 A G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 35 43 12 28 7 A/G 0.3889 2;2 27.5 1 37.1 1 60.0 34.000 0.3953 0 1.1 0 2.000 1 TGGGCCGGCAGCGCTGAACT TTACCTTATATTAAACACGG 10:3214342-3214531 Deletion SNV HCC1187_spiked_UC3_RNA 10 10 3214498 3214498 C T 3 0 1 2 0 0 C/C 0 2;0 22.3 1 34.3 1 60.0 6.000 1.0000 0 0.3 168 2 66 100 1 1 C/T 0.0119 2;2 53.0 1 32.5 1 60.0 4.000 0.0120 0 2.0 0 2.000 2 GGAGGCGTCAGTTCTCTGAA ACAGCGAGACTTTAAGTATG 10:3214342-3214531 StrongLOH SNV HCC1187_spiked_UC3_RNA 10 10 3214516 3214516 A G 1 1 0 0 0 1 G/G 1.0000 0;0 15.0 0 36.0 0 60.0 2.000 1.0000 0 1.0 155 68 30 57 23 45 A/G 0.4387 2;2 36.3 1 35.7 1 60.0 21.667 0.4333 0 1.1 0 1.000 1 AACACAGCGAGACTTTAAGT TGGGCTGTGGGCGCCTCGGG 10:3214342-3214531 Germline SNV HCC1187_spiked_UC3_RNA 10 10 3214995 3214996 AC A 5 3 1 1 1 2 C/-1 0.6000 2;2 45.7 1 35.3 1 60.0 6.000 0.6000 0 0.0 268 125 75 68 79 46 C/-1 0.4664 2;2 34.5 1 35.4 1 60.0 24.000 0.4688 0.0112 0.4 4 5.000 1 GAGCACCTGGCTGGCGAGGA CCCCTCTTAACCCGACGCAG 10:3214887-3215100 Germline Deletion java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) at java.util.concurrent.FutureTask.get(FutureTask.java:111) at com.astrazeneca.vardict.VarDict.somaticParallel(VarDict.java:341) at com.astrazeneca.vardict.VarDict.nonAmpVardict(VarDict.java:254) at com.astrazeneca.vardict.VarDict.start(VarDict.java:68) at com.astrazeneca.vardict.Main.run(Main.java:134) at com.astrazeneca.vardict.Main.main(Main.java:25) Caused by: java.util.concurrent.ExecutionException: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) at java.util.concurrent.FutureTask.get(FutureTask.java:111) at com.astrazeneca.vardict.VarDict$SomdictWorker.call(VarDict.java:5342) at com.astrazeneca.vardict.VarDict$SomdictWorker.call(VarDict.java:5325) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.charAt(String.java:695) at com.astrazeneca.vardict.VarDict.parseSAM(VarDict.java:1881) at com.astrazeneca.vardict.VarDict.toVars(VarDict.java:2537) at com.astrazeneca.vardict.VarDict$ToVarsWorker.call(VarDict.java:5320) at com.astrazeneca.vardict.VarDict$ToVarsWorker.call(VarDict.java:5294)
To assist in determining the issue, I have isolated a bam region which has this problem. I have also run this snippet of bam on my end to demonstrate that that issue is there. There is an additional line mentioning "Ignoring SAM validation error". This is due to my extracting the region and was not in the original issue.
` /hpc/env/dev/apps/vardict_java/1.4.3/build/install/VarDict/bin/VarDict -th 4 \
HCC1187_spiked_UC3_RNA 10 10 3190265 3190265 A G 8 4 2 2 1 3 A/G 0.5000 2;2 24.5 1 36.0 1 60.0 8.000 0.5000 0 2.0 166 85 43 38 40 45 A/G 0.5120 2;2 33.2 1 36.4 1 60.0 170.000 0.5152 0 1.9 0 1.000 1 AAAAGCACACCCTGAACCAA GAAAACACAGAAGAAAGGAA 10:3190173-3190483 Germline SNV HCC1187_spiked_UC3_RNA 10 10 3190298 3190298 C T 7 5 1 1 2 3 C/T 0.7143 2;2 35.6 1 37.0 0 60.0 10.000 0.7143 0 1.6 172 91 40 41 45 46 C/T 0.5291 2;2 26.9 1 33.8 1 60.0 44.500 0.5298 0 1.9 2 1.000 1 GAAAGGAATTAGGGCCGAAT GAACAGTGACAGACACTGGA 10:3190173-3190483 Germline SNV HCC1187_spiked_UC3_RNA 10 10 3200249 3200249 C T 142 104 14 24 46 58 C/T 0.7324 2;2 15.0 1 35.9 1 60.0 33.667 0.7372 0 1.6 101 39 38 24 23 16 C/T 0.3861 2;2 32.0 1 33.7 1 60.0 18.500 0.3737 0 1.9 0 1.000 1 CACTCAACTACTTCATCAAT GTTCTGTCTATGAGGCTTCT 10:3200186-3200426 Germline SNV HCC1187_spiked_UC3_RNA 10 10 3200292 3200292 G A 150 108 23 19 54 54 G/A 0.7200 2;2 29.7 1 35.1 1 60.0 20.600 0.7203 0 1.6 116 47 37 32 26 21 G/A 0.4052 2;2 31.6 1 32.9 1 60.0 94.000 0.4052 0 1.7 1 1.000 1 CGGTCTCAATGTCTTTCTCC CAATCCCTTGGAGGCCGACA 10:3200186-3200426 Germline SNV HCC1187_spiked_UC3_RNA 10 10 3201097 3201097 T C 2 1 1 0 1 0 T/C 0.5000 0;0 42.0 0 37.0 0 60.0 2.000 0.5000 0 1.0 169 87 60 22 56 31 T/C 0.5148 2;2 28.3 1 35.5 1 60.0 20.750 0.5123 0 1.4 0 3.000 1 CTACTGATCTAATCTAATCA AAACTCACCCAACATCAGGA 10:3201094-3201280 Germline SNV HCC1187_spiked_UC3_RNA 10 10 3202065 3202065 T C 418 256 93 68 145 111 T/C 0.6124 2;2 22.9 1 35.6 1 60.0 35.571 0.6103 0 1.2 247 111 98 37 81 30 T/C 0.4494 2;2 31.0 1 33.1 1 60.0 17.500 0.4412 0 1.6 0 2.000 1 TAAGAGGAAGCTAACGCTGA GGTTGTTTGTTTAGAGGGAT 10:3202031-3202237 Germline SNV HCC1187_spiked_UC3_RNA 10 10 3202140 3202140 C T 14 6 3 5 2 4 C/T 0.4286 2;2 46.5 1 35.7 1 60.0 12.000 0.4286 0 1.5 341 153 95 93 72 81 C/T 0.4487 2;2 31.8 1 34.0 1 60.0 75.500 0.4507 0 1.5 0 2.000 5 GAATTCCCTCTGTAAAATGA GTGACGTTGTGAGTAGGCAC 10:3202031-3202237 Germline SNV HCC1187_spiked_UC3_RNA 10 10 3206027 3206027 A G 212 121 43 47 61 60 A/G 0.5708 2;2 26.5 1 35.4 1 60.0 29.250 0.5707 0 1.2 58 33 6 19 9 24 A/G 0.5690 2;2 33.5 1 34.4 1 60.0 66.000 0.5690 0 1.2 0 2.000 1 ACCACTGAGTACGTGTGGTC GGAAGAAGTCTGTTCTGAAG 10:3205874-3206107 Germline SNV HCC1187_spiked_UC3_RNA 10 10 3208557 3208557 A G 245 170 38 37 82 88 A/G 0.6939 2;2 14.2 1 35.5 1 59.7 33.000 0.6904 0 1.1 68 31 13 24 14 17 A/G 0.4559 2;2 23.6 1 33.3 1 60.6 6.750 0.4286 0 1.6 0 1.000 1 CCAGTACTGTCCATGGGAGT GTACGGAACTGCACGCTAGG 10:3208380-3208592 Germline SNV HCC1187_spiked_UC3_RNA 10 10 3208567 3208592 T TGCACGCTAGGGAAGAGAGAGGAATG 274 0 144 129 0 0 T/T 0 2;0 5.8 1 35.9 1 59.6 33.125 1.0000 0 0.7 69 27 16 26 9 18 T/+25 0.3913 2;2 28.1 1 34.1 1 60.4 26.000 0.3939 0.3188 1.3 14 2.000 1 CCATGGGAGTAGTACGGAAC GCACGCTAGGGAAGGAGAAT 10:3208380-3208592 StrongLOH Insertion HCC1187_spiked_UC3_RNA 10 10 3208583 3208583 A C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 43 13 9 20 0 13 A/C 0.3023 2;1 25.5 1 30.9 1 60.0 26.000 0.3171 0 1.5 0 1.000 1 GAACTGCACGCTAGGGAAGG GAATGACCAGAACGCAAAAG 10:3208380-3208592 Deletion SNV HCC1187_spiked_UC3_RNA 10 10 3214369 3214369 A G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 35 43 12 28 7 A/G 0.3889 2;2 27.5 1 37.1 1 60.0 34.000 0.3953 0 1.1 0 2.000 1 TGGGCCGGCAGCGCTGAACT TTACCTTATATTAAACACGG 10:3214342-3214531 Deletion SNV HCC1187_spiked_UC3_RNA 10 10 3214516 3214516 A G 1 1 0 0 0 1 G/G 1.0000 0;0 15.0 0 37.0 0 60.0 2.000 1.0000 0 1.0 155 68 30 57 23 45 A/G 0.4387 2;2 36.3 1 35.7 1 60.0 21.667 0.4333 0 1.1 0 1.000 1 AACACAGCGAGACTTTAAGT TGGGCTGTGGGCGCCTCGGG 10:3214342-3214531 Germline SNV HCC1187_spiked_UC3_RNA 10 10 3214995 3214996 AC A 5 3 1 1 1 2 C/-1 0.6000 2;2 45.7 1 37.0 0 60.0 6.000 0.6000 0 0.0 268 125 75 68 79 46 C/-1 0.4664 2;2 34.5 1 35.4 1 60.0 24.000 0.4688 0.0112 0.4 4 5.000 1 GAGCACCTGGCTGGCGAGGA CCCCTCTTAACCCGACGCAG 10:3214887-3215100 Germline Deletion Ignoring SAM validation error: ERROR: Read name HWI-D00528:212:C81CAANXX:1:1314:13296:35671, No real operator (M|I|D|N) in CIGAR java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) at java.util.concurrent.FutureTask.get(FutureTask.java:111) at com.astrazeneca.vardict.VarDict.somaticParallel(VarDict.java:341) at com.astrazeneca.vardict.VarDict.nonAmpVardict(VarDict.java:254) at com.astrazeneca.vardict.VarDict.start(VarDict.java:68) at com.astrazeneca.vardict.Main.run(Main.java:134) at com.astrazeneca.vardict.Main.main(Main.java:25) Caused by: java.util.concurrent.ExecutionException: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) at java.util.concurrent.FutureTask.get(FutureTask.java:111) at com.astrazeneca.vardict.VarDict$SomdictWorker.call(VarDict.java:5342) at com.astrazeneca.vardict.VarDict$SomdictWorker.call(VarDict.java:5325) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.charAt(String.java:695) at com.astrazeneca.vardict.VarDict.parseSAM(VarDict.java:1881) at com.astrazeneca.vardict.VarDict.toVars(VarDict.java:2537) at com.astrazeneca.vardict.VarDict$ToVarsWorker.call(VarDict.java:5320) at com.astrazeneca.vardict.VarDict$ToVarsWorker.call(VarDict.java:5294)`
Vardict_Specific_Issue_Region.zip
Thank you for your assistance.
Cheers, Sean