AfshinLab / BLR

MIT License
5 stars 0 forks source link

Add sorting to tagfastq for direct ema formated output #37

Closed pontushojer closed 4 years ago

pontushojer commented 4 years ago

Fixes https://github.com/FrickTobias/BLR/issues/219 by sorting barcodes not based on lexicographical order. Instead the order is now assured to not have neighbouring barcodes sharing the 16 bp prefix causing the issue. This is enable though the use of heap sorting using a custom index assignment to each barcode.

The script now writes temporary files when --mapper ema is selected. These are then sorted and outputted to the final FASTQs. Memory load is reasonably low but runtime is about double compared to when not sorting, which is expected.

pontushojer commented 4 years ago

I have now run some tests on this addition. A full run on EMA takes about 36 h which is quite a bit longer than previous bowtie2 than we have as our current default. The majority of time is spent on mapping which is about 26 h in total, much longer than the usual ~16 h for bowtie2.

image

All in all the EMA mapper seams to be functioning with the new sorting introduced here.