Open wangyugui opened 7 years ago
bowtie2 command: perl /usr/hpc-bio/bowtie/bin/bowtie2 --no-mixed --no-discordant --gbar 1000 --end-to-end -k 200 -q -X 800 -x /ssd//biowrk/juglans.hybrid.mRNA/trinity/Trinity.fa.bowtie2 -1 /biowrk/juglans.hybrid.mRNA/fastq.clean/A3/A3-2/_1.fq -2 /biowrk/juglans.hybrid.mRNA/fastq.clean/A3/A3-2/_2.fq -p 36
Hello @wangyugui,
Can you please provide the sample files, so that I can attempt to recreate this issue?
In my environment, It happens in multiple case, and seems a problem of thread-switch/thread sync problem, and no relation to fastq samples.
case 1: E7 18core(36thread)4 run bowtie2 align 36 threads 4 process ->high %sys CPU (thread sync? NO thread overload)
case 2:E5 12core(24 threads)2 run bowtie2 align 48 threads 2 process ->high %sys CPU (thread-switch? thread overload)
case3:E5 12core(24 threads)2 run bowtie2 align 48 threads 1 process ->LOW %sys CPU
bowtie2 2.3.0 show more high %sys usage.
top - 16:22:11 up 2:15, 1 user, load average: 143.27, 140.18, 115.32 Tasks: 1106 total, 3 running, 1103 sleeping, 0 stopped, 0 zombie %Cpu(s): 18.8 us, 80.4 sy, 0.0 ni, 0.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 15852327+total, 13947386+free, 10714515+used, 83348944 buff/cache KiB Swap: 0 total, 0 free, 0 used. 14644515+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10419 root 20 0 35.664g 0.031t 2488 R 14254 2.1 2446:58 /usr/hpc-bio/bowtie/bin/bowtie2-align-s --wrapper basic-0 --local -a + 10416 root 20 0 169052 150360 800 R 58.0 0.0 10:02.81 samtools view -F4 -Su - 10417 root 20 0 60.880g 0.059t 740 S 7.5 4.0 1:26.52 samtools sort -m 5368709120 -@ 144 -no - - 10752 root 20 0 57272 3292 1568 R 1.0 0.0 0:00.11 /bin/top -i 1200 root 20 0 0 0 0 S 0.3 0.0 0:00.09 [kworker/79:1]
I ran bowtie2's aligner on the HG19 genome and collected statistics on the system calls using strace. bowtie2's most common system call is read which is expected. When bowtie2.3 is run with the maximum number of threads available it creates contention with other os and process threads. This results in increased calls to sched_yield.
Why we need sched_yield? sched_yield will make a task switch(thread-switch), and task switch(thread-switch) will use a lot of cpu.
sched_yield is designed for real-time only. We should not use it for bowtie2.
sched_yield is one of the reason why high %sys CPU in some case only.
if there is no other processes, sched_yield maybe NOT cause thread-switch, but in heavy load, sched_yield will always cause thread-switch, and then cpu is wasted.
case 2:E5 12core(24 threads)2 run bowtie2 align 48 threads 2 process ->high %sys CPU (thread-switch? thread overload)
case3:E5 12core(24 threads)2 run bowtie2 align 48 threads 1 process ->LOW %sys CPU
This should be a BUG of bowtie thread. CPU is fully used, but just is wasted. If we fix this problem, we can improvement the performance 3 times in some case.
[root@R930 ~]# opreport -l
Using /usr/hpc-bio/oprofile_data/samples/ for samples directory.
WARNING: Lost samples detected! See /usr/hpc-bio/oprofile_data/samples/operf.log for details.
warning: /no-vmlinux could not be found.
CPU: Intel Broadwell microarchitecture, speed 3.3e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples % image name symbol name
1869492 59.3552 no-vmlinux /no-vmlinux
306897 9.7438 bowtie2-align-s SwAligner::alignNucleotidesEnd2EndSseU8(int&, bool)
185116 5.8773 libc-2.17.so sched_yield
92550 2.9384 bowtie2-align-s RedundantAlns::add(AlnRes const&)
I disabled _FAST_MUTEXASM and TBB, and then %sys time become slow.
%CPU is 2355 only when -p 36, but there is no waste in system call.
so we need to improve it in other way. such as locking threads after one record of fastq is mapped -> locking threads after one set of record of fastq is mapped
%Cpu(s): 16.9 us, 0.2 sy, 0.0 ni, 82.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 15852327+total, 15108259+free, 11641760 used, 62765064 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 15599888+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20146 root 20 0 6462724 1.417g 2060 S 2355 0.1 100:40.65 bowtie2-align-s
Thanks for these reports. Improving thread scaling is something we're actively working on.
Note that sched_yield
is not called by Bowtie 2 directly -- it's probably being called because some global lock (e.g. memory allocation) is contended.
I found sched_yield in bowtie2 source and TBB library.
# grep sched_yield -nr .
./fast_mutex.h:127: sched_yield();
./tinythread.h:686: sched_yield();
# readelf -s /usr/lib64/libtbb.so.2 |grep yield
25: 0000000000000000 0 FUNC GLOBAL DEFAULT UND sched_yield@GLIBC_2.2.5 (3)
Yep, those are locks
Bowtie2 2.3.1 become more buggy.
This is the 'perf report' log. 90% CPU is wast in spin_lok. why we need spin lock in user-space application?
Overhead Command Shared Object Symbol
90.22% bowtie2-align-s [kernel.kallsyms] [k] _raw_spin_lock
1.82% swapper [kernel.kallsyms] [k] intel_idle
0.75% bowtie2-align-s bowtie2-align-s [.] PatternSourcePerThread::nextReadPair
The patch
diff --git a/fast_mutex.h b/fast_mutex.h
index 4d4b7cc..988a764 100644
--- a/fast_mutex.h
+++ b/fast_mutex.h
@@ -36,11 +36,7 @@ freely, subject to the following restrictions:
#define _TTHREAD_PLATFORM_DEFINED_
#endif
-// Check if we can support the assembly language level implementation (otherwise
-// revert to the system API)
-#if (defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))) || \
- (defined(_MSC_VER) && (defined(_M_IX86) || defined(_M_X64))) || \
- (defined(__GNUC__) && (defined(__ppc__)))
+#if 0
#define _FAST_MUTEX_ASM_
#else
#define _FAST_MUTEX_SYS_
and 'make NO_TBB=1' can fix this problem in 2.3.0.
but there seems other problems in bowtie_main.cpp/pat.cpp/pat.h/read_qseq.cpp of 2.3.1.
The patch works for 2.3.2
Hi.
bowtie2 use high sys %Cpu. Doses bowtie2 use system call that will use a lot of cpu?