bowtie2 use high sys %Cpu .

wangyugui commented 7 years ago

Hi.

bowtie2 use high sys %Cpu. Doses bowtie2 use system call that will use a lot of cpu?

#top
Tasks: 1537 total,   5 running, 1532 sleeping,   0 stopped,   0 zombie
%Cpu(s): 75.7 us, 22.1 sy,  0.0 ni,  2.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 10567477+total, 96510118+free, 15082620 used, 76563960 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 10307786+avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
178984 root      20   0 7494520 2.401g   3952 S  3463  0.2   1492:21 bowtie2-align-s
178955 root      20   0 7494520 2.431g   4064 S  3461  0.2   1491:43 bowtie2-align-s
179004 root      20   0 7494520 2.411g   4012 S  3459  0.2   1491:40 bowtie2-align-s
178928 root      20   0 7494520 3.101g   3952 S  3453  0.3   1489:54 bowtie2-align-s
178949 root      20   0  556796 526120   2064 R  66.1  0.0  27:38.28 samtools
178977 root      20   0  556796 525160   2100 R  64.8  0.0  27:46.40 samtools
179001 root      20   0  556796 525632   1964 R  64.8  0.0  27:45.89 samtools
178922 root      20   0  556796 524080   1952 R  64.5  0.0  27:44.98 samtools
181966 root      20   0   57680   5688   3508 R   1.6  0.0   0:00.43 top
181728 root       0 -20       0      0      0 S   0.7  0.0   0:01.82 kworker/117:2H
  5625 root      20   0   19684   3076   2400 S   0.3  0.0   0:08.69 irqbalance

# uptime
 15:06:31 up  2:22,  1 user,  load average: 144.45, 144.30, 137.25
->no overload thread

# bowtie2 --version
/usr/hpc-bio/bowtie/bin/bowtie2-align-s version 2.2.9
64-bit
Built on localhost.localdomain
Thu Apr 21 18:36:37 EDT 2016
Compiler: gcc version 4.1.2 20080704 (Red Hat 4.1.2-54)
Options: -O3 -m64 -msse2  -funroll-loops -g3 -DPOPCNT_CAPABILITY
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

OS:CentOS  7.2
This happened on both kernel version.
# uname -a
Linux R930 3.10.0-327.36.3.el7.x86_64 #1 SMP Mon Oct 24 16:09:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
# uname -a
Linux R930 4.4.36-1.el7.elrepo.x86_64 #1 SMP Fri Dec 2 10:57:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux

wangyugui commented 7 years ago

bowtie2 command: perl /usr/hpc-bio/bowtie/bin/bowtie2 --no-mixed --no-discordant --gbar 1000 --end-to-end -k 200 -q -X 800 -x /ssd//biowrk/juglans.hybrid.mRNA/trinity/Trinity.fa.bowtie2 -1 /biowrk/juglans.hybrid.mRNA/fastq.clean/A3/A3-2/_1.fq -2 /biowrk/juglans.hybrid.mRNA/fastq.clean/A3/A3-2/_2.fq -p 36

ch4rr0 commented 7 years ago

Hello @wangyugui,

Can you please provide the sample files, so that I can attempt to recreate this issue?

wangyugui commented 7 years ago

In my environment, It happens in multiple case, and seems a problem of thread-switch/thread sync problem, and no relation to fastq samples.

case 1: E7 18core(36thread)4 run bowtie2 align 36 threads 4 process ->high %sys CPU (thread sync? NO thread overload)

case 2:E5 12core(24 threads)2 run bowtie2 align 48 threads 2 process ->high %sys CPU (thread-switch? thread overload)

case3:E5 12core(24 threads)2 run bowtie2 align 48 threads 1 process ->LOW %sys CPU

wangyugui commented 7 years ago

bowtie2 2.3.0 show more high %sys usage.

top - 16:22:11 up 2:15, 1 user, load average: 143.27, 140.18, 115.32 Tasks: 1106 total, 3 running, 1103 sleeping, 0 stopped, 0 zombie %Cpu(s): 18.8 us, 80.4 sy, 0.0 ni, 0.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 15852327+total, 13947386+free, 10714515+used, 83348944 buff/cache KiB Swap: 0 total, 0 free, 0 used. 14644515+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10419 root 20 0 35.664g 0.031t 2488 R 14254 2.1 2446:58 /usr/hpc-bio/bowtie/bin/bowtie2-align-s --wrapper basic-0 --local -a + 10416 root 20 0 169052 150360 800 R 58.0 0.0 10:02.81 samtools view -F4 -Su - 10417 root 20 0 60.880g 0.059t 740 S 7.5 4.0 1:26.52 samtools sort -m 5368709120 -@ 144 -no - - 10752 root 20 0 57272 3292 1568 R 1.0 0.0 0:00.11 /bin/top -i 1200 root 20 0 0 0 0 S 0.3 0.0 0:00.09 [kworker/79:1]

ch4rr0 commented 7 years ago

I ran bowtie2's aligner on the HG19 genome and collected statistics on the system calls using strace. bowtie2's most common system call is read which is expected. When bowtie2.3 is run with the maximum number of threads available it creates contention with other os and process threads. This results in increased calls to sched_yield.

wangyugui commented 7 years ago

Why we need sched_yield? sched_yield will make a task switch(thread-switch), and task switch(thread-switch) will use a lot of cpu.

sched_yield is designed for real-time only. We should not use it for bowtie2.

wangyugui commented 7 years ago

sched_yield is one of the reason why high %sys CPU in some case only.

if there is no other processes, sched_yield maybe NOT cause thread-switch, but in heavy load, sched_yield will always cause thread-switch, and then cpu is wasted.

case 2:E5 12core(24 threads)2 run bowtie2 align 48 threads 2 process ->high %sys CPU (thread-switch? thread overload)

case3:E5 12core(24 threads)2 run bowtie2 align 48 threads 1 process ->LOW %sys CPU

wangyugui commented 7 years ago

This should be a BUG of bowtie thread. CPU is fully used, but just is wasted. If we fix this problem, we can improvement the performance 3 times in some case.

[root@R930 ~]# opreport -l
Using /usr/hpc-bio/oprofile_data/samples/ for samples directory.

WARNING: Lost samples detected! See /usr/hpc-bio/oprofile_data/samples/operf.log for details.
warning: /no-vmlinux could not be found.
CPU: Intel Broadwell microarchitecture, speed 3.3e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               symbol name
1869492  59.3552  no-vmlinux               /no-vmlinux
306897    9.7438  bowtie2-align-s          SwAligner::alignNucleotidesEnd2EndSseU8(int&, bool)
185116    5.8773  libc-2.17.so             sched_yield
92550     2.9384  bowtie2-align-s          RedundantAlns::add(AlnRes const&)

wangyugui commented 7 years ago

I disabled _FAST_MUTEXASM and TBB, and then %sys time become slow.

%CPU is 2355 only when -p 36, but there is no waste in system call.

so we need to improve it in other way. such as locking threads after one record of fastq is mapped -> locking threads after one set of record of fastq is mapped

%Cpu(s): 16.9 us,  0.2 sy,  0.0 ni, 82.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 15852327+total, 15108259+free, 11641760 used, 62765064 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 15599888+avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 20146 root      20   0 6462724 1.417g   2060 S  2355  0.1 100:40.65 bowtie2-align-s

BenLangmead commented 7 years ago

Thanks for these reports. Improving thread scaling is something we're actively working on.

Note that sched_yield is not called by Bowtie 2 directly -- it's probably being called because some global lock (e.g. memory allocation) is contended.

wangyugui commented 7 years ago

I found sched_yield in bowtie2 source and TBB library.

# grep sched_yield -nr .
./fast_mutex.h:127:          sched_yield();
./tinythread.h:686:    sched_yield();

# readelf -s /usr/lib64/libtbb.so.2 |grep yield
    25: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND sched_yield@GLIBC_2.2.5 (3)

BenLangmead commented 7 years ago

Yep, those are locks

wangyugui commented 7 years ago

Bowtie2 2.3.1 become more buggy.

This is the 'perf report' log. 90% CPU is wast in spin_lok. why we need spin lock in user-space application?

Overhead  Command          Shared Object            Symbol
  90.22%  bowtie2-align-s  [kernel.kallsyms]        [k] _raw_spin_lock
   1.82%  swapper          [kernel.kallsyms]        [k] intel_idle
   0.75%  bowtie2-align-s  bowtie2-align-s          [.] PatternSourcePerThread::nextReadPair

wangyugui commented 7 years ago

The patch

diff --git a/fast_mutex.h b/fast_mutex.h
index 4d4b7cc..988a764 100644
--- a/fast_mutex.h
+++ b/fast_mutex.h
@@ -36,11 +36,7 @@ freely, subject to the following restrictions:
   #define _TTHREAD_PLATFORM_DEFINED_
 #endif

-// Check if we can support the assembly language level implementation (otherwise
-// revert to the system API)
-#if (defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))) || \
-    (defined(_MSC_VER) && (defined(_M_IX86) || defined(_M_X64))) || \
-    (defined(__GNUC__) && (defined(__ppc__)))
+#if 0
   #define _FAST_MUTEX_ASM_
 #else
   #define _FAST_MUTEX_SYS_

and 'make NO_TBB=1' can fix this problem in 2.3.0.

but there seems other problems in bowtie_main.cpp/pat.cpp/pat.h/read_qseq.cpp of 2.3.1.

wangyugui commented 7 years ago

The patch works for 2.3.2

BenLangmead / bowtie2

bowtie2 use high sys %Cpu . #62