Closed xhebox closed 1 year ago
It seems that the difference is: when BUSY_WAIT the frequency of call yield is not that high, and in x86, replace thread::yield with __builtin_ia32_pause().
Yes. And replacing many atomic ops as one op. Also, use a more relax memory order. And also, reduce branches and make the loop as compact as possible.
Also, (de)activate
will only be called when waiting for user input.
is there any other well-implemented thread pool that we can learn from it? tomorrow, I benchmark the int4 matmul, using 2 threads, the result is not stable, but using only one thread, the result is stable. I also think the thread pool may have some bugs.
is there any other well-implemented thread pool that we can learn from it? tomorrow, I benchmark the int4 matmul, using 2 threads, the result is not stable, but using only one thread, the result is stable. I also think the thread pool may have some bugs.
https://github.com/bshoshany/thread-pool ? I am not an active cpp programmer. But this library seems promising.
# if defined __GNUC__ && (defined __i386__ || defined __x86_64__)
# if !defined(__SSE2__)
static inline void cv_non_sse_mm_pause() { __asm__ __volatile__ ("rep; nop"); }
# define _mm_pause cv_non_sse_mm_pause
# else
# include <immintrin.h>
# endif
# define MTDA_PAUSE(v) do { for (int __delay = (v); __delay > 0; --__delay) { _mm_pause(); } } while (0)
# elif defined __GNUC__ && defined __aarch64__
# define MTDA_PAUSE(v) do { for (int __delay = (v); __delay > 0; --__delay) { asm volatile("yield" ::: "memory"); } } while (0)
# elif defined __GNUC__ && defined __arm__
# define MTDA_PAUSE(v) do { for (int __delay = (v); __delay > 0; --__delay) { asm volatile("" ::: "memory"); } } while (0)
# elif defined __GNUC__ && defined __riscv
// PAUSE HINT is not part of RISC-V ISA yet, but is under discussion now. For details see:
// https://github.com/riscv/riscv-isa-manual/pull/398
// https://github.com/riscv/riscv-isa-manual/issues/43
// # define CV_PAUSE(v) do { for (int __delay = (v); __delay > 0; --__delay) { asm volatile("pause"); } } while (0)
# define MTDA_PAUSE(v) do { for (int __delay = (v); __delay > 0; --__delay) { asm volatile("nop"); } } while (0)
# else
# warning "Can't detect 'pause' (CPU-yield) instruction on the target platform. Specify MTDA_PAUSE() definition via compiler flags."
# define MTDA_PAUSE(...) do { /* no-op: works, but not effective */ } while (0)
# endif
there are the CPU level yield instruction, it will be light than os level yield (thread::yiled())
is there any other well-implemented thread pool that we can learn from it? tomorrow, I benchmark the int4 matmul, using 2 threads, the result is not stable, but using only one thread, the result is stable. I also think the thread pool may have some bugs.
https://github.com/bshoshany/thread-pool ? I am not an active cpp programmer. But this library seems promising.
https://github.com/bshoshany/thread-pool/blob/master/include/BS_thread_pool.hpp#L628 there are locks, I worry about its performance.
https://github.com/bshoshany/thread-pool/blob/master/include/BS_thread_pool.hpp#L628 there are locks, I worry about its performance.
Oops. Maybe something like coroutine pool will be better...?
Anyway, executing all tasks dynamically seems a bad idea for performance. I am thinking that we can just find a library that we can submit all tasks first, and execute later.
EDIT:
https://github.com/dougbinks/enkiTS/blob/master/example/Dependencies.cpp this lib seems interesting to me.
It seems that the difference is: when BUSY_WAIT the frequency of call yield is not that high, and in x86, replace thread::yield with __builtin_ia32_pause().