broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.71k stars 592 forks source link

Determine the cause of very slow / failing intervals #2134

Open cwhelan opened 8 years ago

cwhelan commented 8 years ago

The worst offender right now is interval 13913 in the current data set:

13913 Y:16691826-16692366 hdfs://svdev-1-m:8020/user/cwhelan/outs_tws_kill_promiscuous_kmers/NA12878_PCR-_30X/fastq/assembly13913.fastq

This is the one that takes a long time in SGA correction and filtering, only to have SGA filter all of the reads and therefore blow up.

Is suspicious in that it's on the Y for NA12878 (there are quite a few Y intervals actually).

@tedsharpe if you want to take a look at where these reads are coming from or whether there's any way we could filter intervals of this type out of FindBreakpointEvidence go ahead and assign yourself.

tedsharpe commented 8 years ago

The region consists of an Alu, followed by a poly-A, followed by a piece of a LINE (L1P3). It seems quite hopeless, so I'll try to figure out a way to exclude it.

SHuang-Broad commented 7 years ago

Still valid as of date 2017/07/06.