Open jkbonfield opened 6 years ago
I should add, I haven't tested this on anything large, so perhaps the boom & bust nature of the old BAM multi-threaded isn't serious when given much larger IO buffers, but I have my doubts still that it'd work. We more or less doubled the speed of BAM writing in multi-threaded environments.
Hi James,
thanks for a thorough investigation and the excellent suggestions. Switching to the newer version of htslib is long overdue, there were also some compatibility issues reported. As always, I have to balance fixing something that still works against adding new features, which I am working on now. But I am hoping to get to work on the BAM overhaul and incorporate your suggestions next year.
Cheers Alex
Thanks for the reply.
Some of the new warnings produced are complaints about lack of checking for return value from functions. Fixing this is probably a good thing anyway as we've seen before where tools lead to silent failures when, eg, running out of disk space.
I don't know if anyone has asked for it, but maybe there is room for direct CRAM output too. Note this doesn't have to be reference based (and indeed it's best not if you're trying to write an unsorted file). Options include no-reference (best if unsorted), embedded reference (best if not a publically accessable reference) or external reference (the default). Feel free to ask questions if you have any desire to go down the CRAM road but are unsure of the specifics.
Perhaps your Makefile
could support allowing a user to disable building the internal htslib in favour of using an installed copy (specified by the usual enviroment variables)? That way we could build (at our own risk) against any version we would like to try.
Hi Keith,
this is a good suggestion, however, I think there could a problem with it as well. I had to slightly modify one of the files in the htslib library (bam_cat.c) which probably makes it incompatible with the newer releases of htslib. It's on my TODO list, but I am not sure when I will get to do it.
Cheers Alex
Not withstanding various bugs that have been fixed (including some security ones), there are tangible speed benefits when coupled with a multi-threading tweak.
Right now, if I run STAR on a very noddy and simple test, with deliberately small parameters to force multiple buffers worth of data, I see that STAR multi-threads well when writing to SAM but not when writing to BAM:
SAM:
BAM:
This gets bottlenecks on the bgzf_writes because STAR doesn't multi-thread the bgzf encoding. Looking on
top
I see the CPU is pegged at a steady 450%. Uncompressed BAM would be fine, but we bottleneck on zlib basically as it's single threaded code. It's possible to tweak this, with a hand-wavy guess at the number of threads, in Parameters.cpp:This helps and we go from 4.5x cpu utilised to 7.6x with 12 threads requested. Looking on
top
I see yo-yo between 1150% and 450%. This is because the old htslib threading was rather "boom and bust" - it buffered up work until there was enough and then did a big MT job to compress. This isn't very efficient.Now see the impact of linking against htslib 1.6 with the same 3 threads (12/4) specified for bgzf_mt usage:
This has slightly over-egged the pudding and gone beyond the 12x we asked for.
top
showed a steady 1300-1350% cpu utilisation with no yo-yo back and forth. Clearly it needs some improvement for how to get the correct number of threads working, but perhaps this can just be a second command line parameter.The key thing here though is the same STAR source went from 7.6 to 13.1x cpu utilisation, purely due to upgrading htslib.
There are some caveats to this approach. Linking a new htslib means new libraries required due to the CRAMv3 support (-lbz2 -llzma) which then means more dependencies too. This could be worked around by a minor tweak to the STAR makefile to use the configure script and disable the more recent htslib dependencies. Eg:
This appears to be sufficient.
Is this something you would be interested in?