AstraZeneca-NGS / VarDictJava

VarDict Java port
MIT License
127 stars 56 forks source link

Java version with bed fails where Perl or Java with -R option succeeds #36

Closed sgosline closed 8 years ago

sgosline commented 8 years ago

Hello,

I have been running the Perl version of varDict before I realized I can switch to the Java version. Running nearly identical commands:

./build/install/VarDict/bin/VarDict -G ucsc.hg19.fasta -f 0.01 -N sampleName -b "../SL107646_rg_ordered.bam|../SL107661_rg_ordered.bam" -z -c 1 -S 2 -E 3 -g 4 /home/ubuntu/VarDict/test.bed

However, this gives me the following error: Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at htsjdk.samtools.Cigar.getCigarElement(Cigar.java:57) at com.astrazeneca.vardict.VarDict.parseSAM(VarDict.java:1494) at com.astrazeneca.vardict.VarDict.toVars(VarDict.java:2552) at com.astrazeneca.vardict.VarDict.ampVardictNotParallel(VarDict.java:5514) at com.astrazeneca.vardict.VarDict.start(VarDict.java:60) at com.astrazeneca.vardict.Main.run(Main.java:134) at com.astrazeneca.vardict.Main.main(Main.java:25) When I replace the test.bed file with a -R and the region from that same bed file, everything works fine. This leads me to believe that there is no sam/bam processing issue (I already added read groups and ordered and indexed the bams). Could there be another underlying issue?

thanks, sara

mjafin commented 8 years ago

Hi @sgosline, thanks for your interest in VarDict and thanks for reporting the issue. What does the region look like in the bed (literally)?

I think the -R option uses 1-based end-inclusive format whereas bed is 0-based end-exclusive so I wonder if the bed line could be out of whack

sgosline commented 8 years ago

Hello,

You're welcome!

I have toggled the -z function and get the same error. Here is the test.bed file (I get it from any other region cut from the same larger bed file):

chr17 29421944 29704695 uc002hgg.3 0 + 29422327 29701173 0 58 443,144,84,191,107,68,76,158,174,123,75,132,135,114,80,124,156,250,74,84,441,140,123,84,117,182,212,162,104,136,63,159,98,147,147,111,433,341,203,194,141,280,215,62,115,102,141,127,132,136,158,123,131,101,143,47,217,3665, 0,61056,64083,68259,74964,86495,86783,87581,105495,106110,106484,111313,119524,124078,126923,128517,130168,131508,132291,132596,134098,134908,135333,135915,137146,137773,138075,140684,140991,154057,158011,163417,164105,165442,166784,170302,230893,232572,235369,239911,241406,241708,242441,242892,243098,243777,245578,248082,254193,255256,257330,261533,262033,262342,263553,264042,265560,279086,

Like I said, it works fine with the perl script! sara

mjafin commented 8 years ago

Just out of curiosity, are your samples from PCR amplicon based sequencing? The 7th and 8th column have a special purpose for that. If it's not PCR amplicon sequencing could you use cut -f1-3 on the bed file and see if that works?

sgosline commented 8 years ago

No, I don't need the rest, it was just what came from UCSC. Cutting the file seems to work, I should have tried that. I will verify and close the issue (unless you want to alter the code?). Thank you so much for your prompt reply!

mjafin commented 8 years ago

@sgosline sorry about the issue. We didn't foresee someone accidentally using the 7th and 8th column for some other numeric uses in variant calling. They're reserved for "amplicon awareness" in VarDict in PCR amplicon sequencing but it's difficult to tell if the columns are used for something else.