Please consider upgrading htslib, also add BAM threading.

jkbonfield commented 6 years ago

Not withstanding various bugs that have been fixed (including some security ones), there are tangible speed benefits when coupled with a multi-threading tweak.

Right now, if I run STAR on a very noddy and simple test, with deliberately small parameters to force multiple buffers worth of data, I see that STAR multi-threads well when writing to SAM but not when writing to BAM:

SAM:

 Performance counter stats for '../STAR.orig2 --runThreadN 12 --genomeDir /var/tmp/STAR/source/__/ref --readFilesIn _.fastq --limitIObufferSize 2000000 --limitSjdbInsertNsj 10000 --limitOutSJcollapsed 10000':

     478664.565756 task-clock                #   11.465 CPUs utilized          
             55348 context-switches          #    0.000 M/sec                  
               245 CPU-migrations            #    0.000 M/sec                  
            145003 page-faults               #    0.000 M/sec                  
     1225324870037 cycles                    #    2.560 GHz                     [83.32%]
      742596423069 stalled-cycles-frontend   #   60.60% frontend cycles idle    [83.34%]
      546856340510 stalled-cycles-backend    #   44.63% backend  cycles idle    [66.67%]
      939336738531 instructions              #    0.77  insns per cycle        
                                             #    0.79  stalled cycles per insn [83.34%]
      186662742151 branches                  #  389.966 M/sec                   [83.33%]
       10518556410 branch-misses             #    5.64% of all branches         [83.33%]

      41.748551439 seconds time elapsed

BAM:

 Performance counter stats for '../STAR.orig2 --runThreadN 12 --genomeDir /var/tmp/STAR/source/__/ref --outSAMtype BAM Unsorted --readFilesIn _.fastq --limitIObufferSize 2000000 --limitSjdbInsertNsj 10000 --limitOutSJcollapsed 10000':

     486629.066273 task-clock                #    4.510 CPUs utilized          
             64418 context-switches          #    0.000 M/sec                  
              2899 CPU-migrations            #    0.000 M/sec                  
           3159420 page-faults               #    0.006 M/sec                  
     1167404008204 cycles                    #    2.399 GHz                     [83.34%]
      548654901224 stalled-cycles-frontend   #   47.00% frontend cycles idle    [83.33%]
      424656231332 stalled-cycles-backend    #   36.38% backend  cycles idle    [66.67%]
     1178512197622 instructions              #    1.01  insns per cycle        
                                             #    0.47  stalled cycles per insn [83.34%]
      205388024902 branches                  #  422.063 M/sec                   [83.33%]
       12562668339 branch-misses             #    6.12% of all branches         [83.32%]

     107.909751558 seconds time elapsed

This gets bottlenecks on the bgzf_writes because STAR doesn't multi-thread the bgzf encoding. Looking on top I see the CPU is pegged at a steady 450%. Uncompressed BAM would be fine, but we bottleneck on zlib basically as it's single threaded code. It's possible to tweak this, with a hand-wavy guess at the number of threads, in Parameters.cpp:

diff --git a/source/Parameters.cpp b/source/Parameters.cpp
index 9187548..5f816ed 100644
--- a/source/Parameters.cpp
+++ b/source/Parameters.cpp
@@ -576,6 +576,7 @@ void Parameters::inputParameters (int argInN, char* argIn[]) {//input parameters
                     outBAMfileUnsortedName=outFileNamePrefix + "Aligned.out.bam";
                 };
                 inOut->outBAMfileUnsorted = bgzf_open(outBAMfileUnsortedName.c_str(),("w"+to_string((long long) outBAMcompression)).c_str());
+                if (runThreadN) bgzf_mt(inOut->outBAMfileUnsorted, runThreadN/4, 256);
             };
             if (outBAMcoord) {
                 if (outStd=="BAM_SortedByCoordinate") {

This helps and we go from 4.5x cpu utilised to 7.6x with 12 threads requested. Looking on top I see yo-yo between 1150% and 450%. This is because the old htslib threading was rather "boom and bust" - it buffered up work until there was enough and then did a big MT job to compress. This isn't very efficient.

 Performance counter stats for '../STAR.orig --runThreadN 12 --genomeDir /var/tmp/STAR/source/__/ref --outSAMtype BAM Unsorted --readFilesIn _.fastq --limitIObufferSize 2000000 --limitSjdbInsertNsj 10000 --limitOutSJcollapsed 10000':

     428828.858468 task-clock                #    7.579 CPUs utilized          
             64574 context-switches          #    0.000 M/sec                  
              1099 CPU-migrations            #    0.000 M/sec                  
           3172555 page-faults               #    0.007 M/sec                  
     1181394579760 cycles                    #    2.755 GHz                     [83.33%]
      561132074303 stalled-cycles-frontend   #   47.50% frontend cycles idle    [83.34%]
      430009173538 stalled-cycles-backend    #   36.40% backend  cycles idle    [66.70%]
     1190894424120 instructions              #    1.01  insns per cycle        
                                             #    0.47  stalled cycles per insn [83.35%]
      209240036497 branches                  #  487.934 M/sec                   [83.31%]
       12811913926 branch-misses             #    6.12% of all branches         [83.32%]

      56.581617746 seconds time elapsed

Now see the impact of linking against htslib 1.6 with the same 3 threads (12/4) specified for bgzf_mt usage:

 Performance counter stats for '../STAR --runThreadN 12 --genomeDir /var/tmp/STAR/source/__/ref --outSAMtype BAM Unsorted --readFilesIn _.fastq --limitIObufferSize 2000000 --limitSjdbInsertNsj 10000 --limitOutSJcollapsed 10000':

     509082.457568 task-clock                #   13.163 CPUs utilized          
            247912 context-switches          #    0.000 M/sec                  
              1409 CPU-migrations            #    0.000 M/sec                  
           3156016 page-faults               #    0.006 M/sec                  
     1227771853825 cycles                    #    2.412 GHz                     [83.34%]
      614566254133 stalled-cycles-frontend   #   50.06% frontend cycles idle    [83.32%]
      472721517830 stalled-cycles-backend    #   38.50% backend  cycles idle    [66.70%]
     1182907221955 instructions              #    0.96  insns per cycle        
                                             #    0.52  stalled cycles per insn [83.38%]
      205915091555 branches                  #  404.483 M/sec                   [83.33%]
       12577370748 branch-misses             #    6.11% of all branches         [83.31%]

      38.674906843 seconds time elapsed

This has slightly over-egged the pudding and gone beyond the 12x we asked for. top showed a steady 1300-1350% cpu utilisation with no yo-yo back and forth. Clearly it needs some improvement for how to get the correct number of threads working, but perhaps this can just be a second command line parameter.

The key thing here though is the same STAR source went from 7.6 to 13.1x cpu utilisation, purely due to upgrading htslib.

There are some caveats to this approach. Linking a new htslib means new libraries required due to the CRAMv3 support (-lbz2 -llzma) which then means more dependencies too. This could be worked around by a minor tweak to the STAR makefile to use the configure script and disable the more recent htslib dependencies. Eg:

diff --git a/source/Makefile b/source/Makefile
index 2b5a34c..992f370 100644
--- a/source/Makefile
+++ b/source/Makefile
@@ -98,6 +98,7 @@ endif
 htslib : htslib/libhts.a

 htslib/libhts.a :
+       cd htslib && ./configure --disable-libcurl --disable-bz2 --disable-lzma
        $(MAKE) -C htslib lib-static

 parametersDefault.xxd: parametersDefault

This appears to be sufficient.

Is this something you would be interested in?

jkbonfield commented 6 years ago

I should add, I haven't tested this on anything large, so perhaps the boom & bust nature of the old BAM multi-threaded isn't serious when given much larger IO buffers, but I have my doubts still that it'd work. We more or less doubled the speed of BAM writing in multi-threaded environments.

alexdobin commented 6 years ago

Hi James,

thanks for a thorough investigation and the excellent suggestions. Switching to the newer version of htslib is long overdue, there were also some compatibility issues reported. As always, I have to balance fixing something that still works against adding new features, which I am working on now. But I am hoping to get to work on the BAM overhaul and incorporate your suggestions next year.

Cheers Alex

jkbonfield commented 6 years ago

Thanks for the reply.

Some of the new warnings produced are complaints about lack of checking for return value from functions. Fixing this is probably a good thing anyway as we've seen before where tools lead to silent failures when, eg, running out of disk space.

I don't know if anyone has asked for it, but maybe there is room for direct CRAM output too. Note this doesn't have to be reference based (and indeed it's best not if you're trying to write an unsorted file). Options include no-reference (best if unsorted), embedded reference (best if not a publically accessable reference) or external reference (the default). Feel free to ask questions if you have any desire to go down the CRAM road but are unsure of the specifics.

keithj commented 6 years ago

Perhaps your Makefile could support allowing a user to disable building the internal htslib in favour of using an installed copy (specified by the usual enviroment variables)? That way we could build (at our own risk) against any version we would like to try.

alexdobin commented 6 years ago

Hi Keith,

this is a good suggestion, however, I think there could a problem with it as well. I had to slightly modify one of the files in the htslib library (bam_cat.c) which probably makes it incompatible with the newer releases of htslib. It's on my TODO list, but I am not sure when I will get to do it.

Cheers Alex

alexdobin / STAR

Please consider upgrading htslib, also add BAM threading. #351