dcjones / quip

Compressing next-generation sequencing data with extreme prejudice.
http://www.cs.washington.edu/homes/dcjones/quip/
BSD 3-Clause "New" or "Revised" License
78 stars 10 forks source link

Quip core dump #24

Closed jkbonfield closed 9 years ago

jkbonfield commented 9 years ago

Trying ftp://ftp.sra.ebi.ac.uk/vol1/ERA172/ERA172924/bam/NA12877_S1.bam with 1.1.8 quip causes a core dump.

I have a tiny 30 line SAM file version of this BAM (it's just the header and 2 reads) and it core dumps on that. If I remove the very first sequence then it works, even on a 10 million read subset (all bar 1st read). I cannot see anything immediately obvious as to why.

I produced a minimal subset (you'll have to retabify of course). Changing the "=" to "*" for mate ref on 1st line prevents the crash, but similar constructs are elsewhere in the file and it copes.

@RG     ID:NA12877      SM:NA12877
@SQ     SN:chrM LN:16571
foo     117     chrM    1       0       *       =       1       0       TGGTTAATAG      :C@@C>C<?A
foo     153     chrM    1       37      10M     =       1       0       GATCACAGGT      :++:CA>A>3
dcjones commented 9 years ago

Thanks James!

jkbonfield commented 9 years ago

Thanks Daniel, however I bear bad news that the decoder looks to have a similar bug as it crashes there now too. No doubt it's similarly trivial.

The full file I am analysing is ftp://ftp.sra.ebi.ac.uk/vol1/ERA172/ERA172924/bam/NA12877_S1.bam - a 122Gb platinum genomes file. I'm just trying to get some comparisons of CRAM vs Deez vs Quip (you're ahead, but without random access). To make sure I'm testing it to the fullest potential, does the -a option have any effect when using -r? Presumably it's only if there are unaligned data in the file.

dcjones commented 9 years ago

Serves me right for not waiting for your full dataset to download and compress. It should really be fixed now.

The -a option may have some effect, but it does only work on unaligned reads. I could imagine viral/bacterial contaminates getting assembled and leading to some improvement, but it's otherwise unlikely to help very much on aligned human dna sequencing.

jkbonfield commented 9 years ago

On Mon, Mar 09, 2015 at 05:42:09PM -0700, Daniel C. Jones wrote:

Serves me right for not waiting for your full dataset to download and compress. It should really be fixed now.

Thanks Daniel. It works for me on a small test. I'm now running it on the full 122Gb of BAMiness.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.