GregoryFaust / samblaster

samblaster: a tool to mark duplicates and extract discordant and split reads from sam files.
MIT License
225 stars 30 forks source link

non-human genomes with many contigs #21

Closed carsonhh closed 8 years ago

carsonhh commented 8 years ago

I use samblaster as part of a pipeline for human and non-human genome processing (it was suggested to me by by Aaron Quinlan who used it for his speed seek pipeline).

I have observed that non-human genomes with many contigs do not work with samblaster because of a hard limit in the code (as well as datatype length limits). Also when I have a job that is just below the contig limit, runs use a very large amount of RAM.

This occurs because the hash matrix you build uses contig-contig matches for binning (basically grows N^2). However, if you assign each contig an offset early on, you can bin them using those offsets instead and generate bin-bin matches for the hash rather than contig-contig matches. This both reduces RAM usage, speeds up runtimes on files with many contigs, and allows even genomes with hundreds of thousands of contigs to run without issue.

I have attached a samblaster.cpp file with the changes I mentioned. For human genomes, there is no penalty for doing this extra processing step (runtimes stay the same with the bin-bin matrix vs the contig-contig matrix).

I also changes the position datatype to a signed 32bit int rather than an unsigned one because it can apparently be negative when there are softclips near the stat of the alignments (not an issue in human because of NNN padding in the assembly, but it is an issue in other species).

I hope these changes can be integrated into samblaster as it will open it's usage up to non-human species and greatly reduce RAM usage for those genomes samblaster.cpp.txt

.

GregoryFaust commented 8 years ago

Perhaps I am not understanding your proposal, or you are not understanding how samblaster works. As the chrom/contig information is not stored in the signature in the hash table, one needs a separate hash table for each contig/contig pair (and also for strand). If one assigns multiple contigs to a given bin, then samblaster will incorrectly identify read pairs as dups solely because they happen to be mapped to the same bin instead of actually being on the same chroms/contigs. While this will indeed decrease memory usage, it will create false positive duplicate assignments. Keep in mind that BOTH SIDES of the read pair must be on the corresponding chroms/contigs as another pair, with the same corresponding offsets, to be considered as a duplicate pair.

carsonhh commented 8 years ago

By using the position within the bin rather than the position within the contig, you still generate uniq signatures. Think of it like concatenating every contig in the genome based on their order in the header. You recalculate the position relative to the merged super-contig, and that new relative position stays uniq.

That is good enough for most genomes, but some plant genomes are 32 gigabases or larger (won’t fit in 32 bit int signature). So by further assigning a relative position within the bin (rather than genome wide) you can get a uniq signature just for that bin that fits in 32 bits. Think of it like merging the genome into a long super contig, then splitting that super contig into smaller equal sized pseudo-contigs. Each one represents it’s own bin, and the new position you assign relative to the pseudo-contig will be uniq for that bin.

You can do this by using an offset and thebn modulus operations, or if you assign bins to be 2-based (i.e. a 27 bit window for 134 megabase pseudo-contigs) then you can just do bit shift operations to get bins and bin relative positions.

Example using pseudo-code (the seqOffs element in the state struct has already been filled in with sequence offsets)

define BIN_SHIFT 27 //bin window is 27 bits wide

define BIN_MASK ((1 << 27)-1) //bin window is 27 bits wide

UINT64 seqOff = state->seqOffs[first->seqNum]; //genome relative position first->binNum = (seqOff + second->pos) >> BIN_SHIFT; //bin number first->binPos = (seqOff + second->pos) & BIN_MASK; //bin relative position

//signature is now calculated using bin position rather than contig position inline sgn_t calcSig(splitLine_t * first, splitLine_t * second) { // Total nonsense to get the compiler to actually work. UINT64 t1 = first->binPos; //sign becomes part of unsigned signature UINT64 t2 = t1 << 32; UINT64 final = t2 | second->binPos; return (sgn_t)final; }

—Carson

On May 16, 2016, at 12:40 PM, Gregory Faust notifications@github.com wrote:

Perhaps I am not understanding your proposal, or you are not understanding how samblaster works. As the chrom/contig information is not stored in the signature in the hash table, one needs a separate hash table for each contig/contig pair (and also for strand). If one assigns multiple contigs to a given bin, then samblaster will incorrectly identify read pairs as dups solely because they happen to be mapped to the same bin instead of actually being on the same chroms/contigs. While this will indeed decrease memory usage, it will create false positive duplicate assignments. Keep in mind that BOTH SIDES of the read pair must be on the corresponding chroms/contigs as another pair, with the same corresponding offsets, to be considered as a duplicate pair.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/GregoryFaust/samblaster/issues/21#issuecomment-219508938

carsonhh commented 8 years ago

One more simple example for how conversion works

offset chrA = 0; chrA position 200 then becomes genome wide position 200

offset chrB = 25000; chrB position 200 then becomes genome wide position 25200

It is therefore impossible for a position on one contig to map to the same genome relative position as a position on another contig.

—Carson

On May 16, 2016, at 1:00 PM, Carson Holt carsonhh@gmail.com wrote:

By using the position within the bin rather than the position within the contig, you still generate uniq signatures. Think of it like concatenating every contig in the genome based on their order in the header. You recalculate the position relative to the merged super-contig, and that new relative position stays uniq.

That is good enough for most genomes, but some plant genomes are 32 gigabases or larger (won’t fit in 32 bit int signature). So by further assigning a relative position within the bin (rather than genome wide) you can get a uniq signature just for that bin that fits in 32 bits. Think of it like merging the genome into a long super contig, then splitting that super contig into smaller equal sized pseudo-contigs. Each one represents it’s own bin, and the new position you assign relative to the pseudo-contig will be uniq for that bin.

You can do this by using an offset and thebn modulus operations, or if you assign bins to be 2-based (i.e. a 27 bit window for 134 megabase pseudo-contigs) then you can just do bit shift operations to get bins and bin relative positions.

Example using pseudo-code (the seqOffs element in the state struct has already been filled in with sequence offsets)

define BIN_SHIFT 27 //bin window is 27 bits wide

define BIN_MASK ((1 << 27)-1) //bin window is 27 bits wide

UINT64 seqOff = state->seqOffs[first->seqNum]; //genome relative position first->binNum = (seqOff + second->pos) >> BIN_SHIFT; //bin number first->binPos = (seqOff + second->pos) & BIN_MASK; //bin relative position

//signature is now calculated using bin position rather than contig position inline sgn_t calcSig(splitLine_t * first, splitLine_t * second) { // Total nonsense to get the compiler to actually work. UINT64 t1 = first->binPos; //sign becomes part of unsigned signature UINT64 t2 = t1 << 32; UINT64 final = t2 | second->binPos; return (sgn_t)final; }

—Carson

On May 16, 2016, at 12:40 PM, Gregory Faust <notifications@github.com mailto:notifications@github.com> wrote:

Perhaps I am not understanding your proposal, or you are not understanding how samblaster works. As the chrom/contig information is not stored in the signature in the hash table, one needs a separate hash table for each contig/contig pair (and also for strand). If one assigns multiple contigs to a given bin, then samblaster will incorrectly identify read pairs as dups solely because they happen to be mapped to the same bin instead of actually being on the same chroms/contigs. While this will indeed decrease memory usage, it will create false positive duplicate assignments. Keep in mind that BOTH SIDES of the read pair must be on the corresponding chroms/contigs as another pair, with the same corresponding offsets, to be considered as a duplicate pair.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/GregoryFaust/samblaster/issues/21#issuecomment-219508938

carsonhh commented 8 years ago

Try compiling samblaster with the cpp file I attached to the first message. You will see it produces identical results as the current samblaster, but is much faster and uses far less RAM on genomes with 30,000+ contigs. You can look through the code to follow the logic of the binning and offset modifications.

GregoryFaust commented 8 years ago

Thanks for the suggestion and the code. It really is MUCH faster when there are a lot of contigs. Closed in release 0.1.23.