BenLangmead / bowtie2

A fast and sensitive gapped read aligner
GNU General Public License v3.0
638 stars 160 forks source link

Increasing --threads increases execution time #437

Closed bede closed 8 months ago

bede commented 1 year ago

Firstly, thank you for developing and maintaining Bowtie2! I noticed that 2.5.1 performs strangely when varying the --threads parameter. Beyond a certain number of threads, execution time increases considerably and CPU utilisation decreases. I've observed this with multiple read datasets using both an x86_64 Ubuntu VM and my arm64 MacOS machine, both using the appropriate GitHub release binaries. The behaviour is reproducible on any one machine, although a given read dataset will not necessarily trigger the problem on both my laptop and the VM.

Here I used ~20m paired mixed bacterial and human reads with an index built from the human T2T reference + HLA sequences.

time bowtie2 -x human-index --threads ${threads} -1 all.bwa.read1.fastq.gz -2 all.bwa.read2.fastq.gz > /dev/null

visualization(3)

VM info

Ubuntu 22.04 LTS

$ bowtie2 --version
/data/bowtie2/bin/bowtie2-align-s version 2.5.1
64-bit
Built on 0ba86a911637
Wed Jan 18 03:20:56 UTC 2023
Compiler: gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC) 
Options: -O3 -msse2 -funroll-loops -g3 -g -O2 -fvisibility=hidden -I/hbb_exe_gc_hardened/include -ffunction-sections -fdata-sections -fstack-protector -D_FORTIFY_SOURCE=2 -fPIE -std=c++11 -DPOPCNT_CAPABILITY -DNO_SPINLOCK -DWITH_QUEUELOCK=1 -DWITH_ZSTD
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}
$ uname -r
5.15.0-1030-oracle
$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         40 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  32
  On-line CPU(s) list:   0-31
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 7J13 64-Core Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  2
    Core(s) per socket:  16
    Socket(s):           1
    Stepping:            1
    BogoMIPS:            4890.80
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good
                          nopl cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c r
                         drand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmca
                         ll fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd a
                         rat npt nrip_save umip pku ospke vaes vpclmulqdq rdpid arch_capabilities
Virtualization features: 
  Virtualization:        AMD-V
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   1 MiB (16 instances)
  L1i:                   1 MiB (16 instances)
  L2:                    8 MiB (16 instances)
  L3:                    64 MiB (4 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-31
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected
snayfach commented 11 months ago

Rolling back to bowtie2-2.4.4-linux-x86_64 solved the issue for me. More recent versions failed to use the number of specified threads. On a machine with 224 CPUs, only 2 were being used using the latest version.

bede commented 11 months ago

Thank you @snayfach, rolling back to 2.4.4 also worked for me. Great news.

ch4rr0 commented 11 months ago

Hello all,

Thank you for your patience. My initial hunch is that this issue was introduced in 2.5.0 with the changes async changes that we made to input reading an writing. @bede, since you are able to reproduce this issue would you be willing to test v2.4.5 and v2.5.0?

bede commented 11 months ago

Thanks @ch4rr0, I tested v2.4.5 and v2.5.0 as you suggested on a 32 core machine specifying 32 threads. 2.4.5 consistently used ~3200% CPU (reported by top) as expected.

However 2.5.0 was erratic, jumping between 400% and 3200%. I haven't compared 2.5.0 and 2.5.1, but it is clear to me that this issue began with 2.5.0.

ch4rr0 commented 11 months ago

@bede, I pushed a change-set to the bug_fixes branch. Would be willing to test whether it has any effect on thread behavior?

ch4rr0 commented 11 months ago

@snayfach -- would you be willing to test since I have not heard back from bede yet?

bede commented 11 months ago

Sorry for delay @ch4rr0, I just got time to test

I'm afraid the same behaviour remains with the bug_fixes branch in my testing. 800-1600% CPU usage with 32 physical cores, whereas 2.4.5 gives >3100%

$ git status
On branch bug_fixes
Your branch is up to date with 'origin/bug_fixes'.

nothing to commit, working tree clean
$ bowtie2 --version
/data16/bowtie2-bug_fixes_2023-08-02/bowtie2-align-s version 2.5.1
64-bit
Built on pikachu
Wed Aug  2 13:27:40 UTC 2023
Compiler: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
Options: -O3 -msse2 -funroll-loops -g3 -std=c++11 -DPOPCNT_CAPABILITY -DNO_SPINLOCK -DWITH_QUEUELOCK=1
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}
ch4rr0 commented 11 months ago

I have been able to recreate this issue and indeed my latest push does not resolve it. I have found the cause and I am currently working on a solution. In the meantime increasing the --reads-per-batch seems to keep the threads busy for longer and decreasing contention on the producer. The current default is 16, increasing that number to 1024 or 2048 seemed like a sweet spot on my hardware.

Thank you all for your patience.

ch4rr0 commented 10 months ago

I committed a few changes to bug_fixes that, in my testing, seem to resolve the issue. Would you be willing to test these changes?

bede commented 10 months ago

Thanks @ch4rr0, your changes in 6f6458c5dae6ef8931c6024b1a20b5c2e275607d have resolved the issue for my test case. Looks good to me!

bede commented 8 months ago

I'm reasonably satisfied that this issue has been fixed in 2.5.2, thanks! 🙏