PapenfussLab / gridss

GRIDSS: the Genomic Rearrangement IDentification Software Suite
Other
257 stars 71 forks source link

error after variant calling using gridss-0.11.4-jar-with-dependencies #30

Closed alkodsi closed 8 years ago

alkodsi commented 8 years ago

Exception in thread "foo.bam-Coverage" java.lang.RuntimeException: java.lang.RuntimeException: java.lang.InterruptedException at au.edu.wehi.idsv.util.AsyncBufferedIterator$ReaderRunnable.run(AsyncBufferedIterator.java:136) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: java.lang.InterruptedException at htsjdk.samtools.util.BlockCompressedInputStream.checkAndRethrowDecompressionException(BlockCompressedInputStream.java:422) at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:409) at htsjdk.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:147) at htsjdk.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:286) at java.io.DataInputStream.read(DataInputStream.java:149) at htsjdk.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:404) at htsjdk.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:380) at htsjdk.samtools.util.BinaryCodec.readByteBuffer(BinaryCodec.java:490) at htsjdk.samtools.util.BinaryCodec.readInt(BinaryCodec.java:501) at htsjdk.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:194) at htsjdk.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:661) at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:635) at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:629) at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:599) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:545) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:519) at au.edu.wehi.idsv.util.AutoClosingIterator.next(AutoClosingIterator.java:55) at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:68) at htsjdk.samtools.util.PeekableIterator.next(PeekableIterator.java:54) at htsjdk.samtools.filter.FilteringIterator.getNextRecord(FilteringIterator.java:130) at htsjdk.samtools.filter.FilteringIterator.next(FilteringIterator.java:105) at htsjdk.samtools.filter.FilteringIterator.next(FilteringIterator.java:45) at au.edu.wehi.idsv.util.AutoClosingIterator.next(AutoClosingIterator.java:55) at au.edu.wehi.idsv.util.AsyncBufferedIterator$ReaderRunnable.run(AsyncBufferedIterator.java:124) ... 1 more Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:400) at htsjdk.samtools.util.BlockCompressedInputStream$AysncUtil.getNextBlock(BlockCompressedInputStream.java:651) at htsjdk.samtools.util.BlockCompressedInputStream$AysncUtil.getNextBlock(BlockCompressedInputStream.java:662) at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:404) ... 23 more

d-cameron commented 8 years ago

Thank you for the bug report. Is the output file still written? There is a known bug in which an exception is thrown is sometimes thrown when when closing a stream but by this point the output file has been written. Does this hold true for this case? A full fix is waiting on the superior implementation https://github.com/samtools/htsjdk/pull/576 to be accepted into htsjdk.

alkodsi commented 8 years ago

Thank you for fast response and wonderful tool. Yes the vcf file is successfully written. I wonder what analyses are downstream that I am missing because of this error. I have some questions about gridss. Is there a recommended tool for annotating breakpoints produced by gridss? How can I know the SV type of the breakpoint, all my breakpoints have type "BND", what does it mean?

d-cameron commented 8 years ago

The VCF output is the final step, if it exists,the error can be safely ignored. GRIDSS is designed to be able to be rerun from any point after being killed so the VCF (and all intermediate files) are written first to a tmp file, then if successful, the tmp file is renamed to the correct file name. This prevents partially written files being treated as the full output when an error occurs.

d-cameron commented 8 years ago

GRIDSS writes SV breakpoints using the VCFv4.1 (https://samtools.github.io/hts-specs/VCFv4.1.pdf) breakpoint BND notation. Fundamentally, GRIDSS is a breakpoint detection tool. Breakpoint classification is left to downstream tools (such as https://github.com/PapenfussLab/clove) as correct classification requires read depth information (DEL should have a CN change) and many events of interest do not fit easily into DEL/INS/DUP/INV (retrocopied genes, chromothripsis).

d-cameron commented 8 years ago

We currently use https://github.com/d-cameron/StructuralVariantAnnotation (still under active development) for downstream analysis and processing as it enables unified processing of output from GRIDSS, BreakDancer, CREST, DELLY, HYDRA, LUMPY, Manta, Pindel, Socrates, and TIGRA.

Alternatively, you could use au.edu.wehi.idsv.VcfBreakendToBedpe to just convert the GRIDSS VCF directly to BEDPE if that is a more useful file format for you analysis.

alkodsi commented 8 years ago

Thanks for all the answers. Clove seems to accept only BEDPE files from gridss, is that correct? Which bam files should I use with clove? By the way, when I convert the result vcf to BEDPE, I lose the numbers per category. I have multi-region samples so I wonder which is the best measure to decide if a breakpoint belongs to one category but not the others. Thank you again for your help.

d-cameron commented 8 years ago

VCF is fairly annoying to parse and deal with. If you're familiar with R, then my utility package helps quite a bit:

install_github("d-cameron/StructuralVariantAnnotation")
library(StructuralVariantAnnotation)
library(VariantAnnotation)
library(GenomicRanges)
library(stringr)

# Load VCF using bioconductor VariantAnnotation package
vcf <- readVcf("multisample.gridss.vcf", "hg19")

# Optional: filter out low confidence calls
# NB: if your VCF is very large, you might want to remove these
# rows using the linux grep utility as R is not particularly memory efficient
vcf <- vcf[rowRanges(vcf)$FILTER %in% c(".", "PASS"),]

# unpack the VCF fields columns into a data frame for ease of manipulation
df <- unpack(info(vcf))

# Filtering example:
# remove all call with any support from category 1
# If you have IC=1 as your normals, and IC=2 as your somatic category,
# you can subset based on the presence/lack of support
somaticvcf <- vcf[df$SR.1 + df$RSR.1 + df$RP.1 + df$ASSR.1 + df$ASRP.1 == 0,]
somaticloh <- vcf[df$SR.2 + df$RSR.2 + df$RP.2 + df$ASSR.2 + df$ASRP.2 == 0,]

These subsets can be output using writeVcf, or to BEDPE directly:

# Convert to BEDPE
gr <- breakpointRanges(vcfwithcallsofinterest)
bedpe <- data.frame(
    chrom1=seqnames(gr),
    start1=start(gr),
    end1=end(gr),
    chrom1=seqnames(partner(gr)),
    start1=start(partner(gr)),
    end1=end(partner(gr)),
    name=names(gr),
    score=gr$QUAL,
    strand1=strand(gr),
    strand2=strand(partner(gr))
    )
bedpe <- bedpe[str_detect(bedpe$name, "gridss[0-9]+o"),] # Just the lower of the two breakends
write.table(bedpe, "callsofinterest.gridss.bedpe", quote=FALSE, sep='\t', row.names=FALSE, col.names=FALSE)

If you have low purity/cross-contamination, you might not want an absolute zero threshold and might want to look at the contribution from each category

# you can generate per-category quality contributions
# Total contribution is the sum of split read (from both sides),
# discordant read pairs, and (pro-rata) assembly contribution
# from that category
df$Q.1 <- df$SRQ.1 + df$RSRQ.1 + df$RPQ.1 + (df$ASQ + df$RASQ) * (df$ASSR.1 + df$ASRP.1) / (df$ASSR + df$ASRP)
df$Q.2 <- df$SRQ.2 + df$RSRQ.2 + df$RPQ.2 + (df$ASQ + df$RASQ) * (df$ASSR.2 + df$ASRP.2) / (df$ASSR + df$ASRP)

# then you can filter along the lines of df$Q.1 / df$QUAL < 0.01