AstraZeneca-NGS / VarDictJava

VarDict Java port
MIT License
127 stars 55 forks source link

index out of bound error related to hard clipping STILL an issue #341

Open rebber opened 3 years ago

rebber commented 3 years ago

Hi,

I have an issue that is similar to #283 and #254 but not fixed by the fixes for these (PRs #293 and #284 ). I commented on #283 last year, but maybe no-one saw it since that issue was already closed. So here it comes again.

The error is:

There was Exception while processing record on KH20201014:17728542 on region 17:41264305-41264924. The processing will be continued from the next record.
java.lang.StringIndexOutOfBoundsException: String index out of range: 71
        at java.lang.String.charAt(String.java:658)
        at com.astrazeneca.vardict.modules.CigarParser.parseCigar(CigarParser.java:523)
        at com.astrazeneca.vardict.modules.CigarParser.process(CigarParser.java:88)
        at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
        at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:614)
        at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1983)
        at com.astrazeneca.vardict.modes.AbstractMode.pipeline(AbstractMode.java:48)
        at com.astrazeneca.vardict.modes.SomaticMode.processBothBamsInPipeline(SomaticMode.java:111)
        at com.astrazeneca.vardict.modes.SomaticMode.notParallel(SomaticMode.java:49)
        at com.astrazeneca.vardict.VarDictLauncher.start(VarDictLauncher.java:65)
        at com.astrazeneca.vardict.Main.main(Main.java:15)

and repeated for a number of reads with the same issue.

The program just skips the read with error and then continues to the next read, so we have just ignored these errors. However, we are now sequencing to higher depths, and due to that often get >10 reads with this error in the same region. Then the Configuration.MAX_EXCEPTION_COUNT is hit and the whole program is stopped with error like:

VarDictJava fails (there were 11 continued exceptions during the run).
Critical exception occurs on region: 17:41264305-41264924, program will be stopped.
java.util.concurrent.CompletionException: java.lang.RuntimeException: java.lang.StringIndexOutOfBoundsException: String index out of range: 79

Since the program won't complete to the end of the bed file for these cases this has become an increasing problem for us.

The command I run is:

vardict-java -G human_g1k_v37_decoy.fasta -f 0.01 -N tumor -b "indel_HC_error.tumor.bam|indel_HC_error.normal.bam" -mfreq 0.01 -nmfreq 0.01  -c 1 -S 2 -E 3 -g 4 -Q 10 problem_region.bed 

I have VarDictJava v 1.8.2 (installed with conda, thus the direct "vardict-java" call). The error happens only when local realignment is used. With option "-k 0" it doesn't happen, but we really want to use that functionality. I attach a zip with minimal bams and bed which trigger the error (although <10 times, so the program is not stopped by it). minimal_bam_github.zip

I have paired reads (mapped to GRCh37) where the overlapping parts in the pairs have been trimmed with fgbio ClipBam, leaving hard clipped bases in the CIGAR. The issue happens at a very specific pattern in the CIGAR: xM3D8M3IyH (with x as num mapped and y as num hard clipped). I have tried to follow in the source code what happens to these reads, and has found the following flow for parsing the CIGAR:

I see multiple possible solutions:

Is it possible that you could add a fix for this? I am not very experienced in writing java (just reading it), and also doesn't know how my suggested solutions would go with other parts of the logic, therefore I don't want to make a pull request with any of my solutions. But hopefully you could do this in a good way.

Many thanks! Rebecka

rebber commented 2 years ago

@asmlgkj I'm not sure it's the exact same error you are having. But anyway, we solved it temporarily by making a local clone of vardict and in there increasing the value of Configuration.MAX_EXCEPTION_COUNT to a value high enough so that the number of exceptions per site wouldn't reach it, with our typical read depth. So maybe you could try something like that.

ghost commented 2 years ago

As of March 2022, there are still problems with this part of the code (CigarParser.java), from what I can see mostly related to not checking whether the index created is within the bounds of the string, eg.

        // $q1 quality of last base before deletion
        char qualityOfLastSegmentBeforeDel = queryQuality.charAt(readPositionIncludingSoftClipped - 1);

there is no check that readPositionIncludingSoftClipped is > 0 (and indeed sometimes is not). There are at least two or three more places like that.