MegEngine / InferLLM

a lightweight LLM model inference framework
Apache License 2.0
695 stars 87 forks source link

new thread worker test #62

Closed xhebox closed 1 year ago

chenqy4933 commented 1 year ago

It seems that the difference is: when BUSY_WAIT the frequency of call yield is not that high, and in x86, replace thread::yield with __builtin_ia32_pause().

xhebox commented 1 year ago

It seems that the difference is: when BUSY_WAIT the frequency of call yield is not that high, and in x86, replace thread::yield with __builtin_ia32_pause().

Yes. And replacing many atomic ops as one op. Also, use a more relax memory order. And also, reduce branches and make the loop as compact as possible.

Also, (de)activate will only be called when waiting for user input.

chenqy4933 commented 1 year ago

is there any other well-implemented thread pool that we can learn from it? tomorrow, I benchmark the int4 matmul, using 2 threads, the result is not stable, but using only one thread, the result is stable. I also think the thread pool may have some bugs.

xhebox commented 1 year ago

is there any other well-implemented thread pool that we can learn from it? tomorrow, I benchmark the int4 matmul, using 2 threads, the result is not stable, but using only one thread, the result is stable. I also think the thread pool may have some bugs.

https://github.com/bshoshany/thread-pool ? I am not an active cpp programmer. But this library seems promising.

chenqy4933 commented 1 year ago
# if defined __GNUC__ && (defined __i386__ || defined __x86_64__)
#   if !defined(__SSE2__)
      static inline void cv_non_sse_mm_pause() { __asm__ __volatile__ ("rep; nop"); }
#     define _mm_pause cv_non_sse_mm_pause
#   else
#       include <immintrin.h>
#   endif
#   define MTDA_PAUSE(v) do { for (int __delay = (v); __delay > 0; --__delay) { _mm_pause(); } } while (0)
# elif defined __GNUC__ && defined __aarch64__
#   define MTDA_PAUSE(v) do { for (int __delay = (v); __delay > 0; --__delay) { asm volatile("yield" ::: "memory"); } } while (0)
# elif defined __GNUC__ && defined __arm__
#   define MTDA_PAUSE(v) do { for (int __delay = (v); __delay > 0; --__delay) { asm volatile("" ::: "memory"); } } while (0)
# elif defined __GNUC__ && defined __riscv
// PAUSE HINT is not part of RISC-V ISA yet, but is under discussion now. For details see:
// https://github.com/riscv/riscv-isa-manual/pull/398
// https://github.com/riscv/riscv-isa-manual/issues/43
// #   define CV_PAUSE(v) do { for (int __delay = (v); __delay > 0; --__delay) { asm volatile("pause"); } } while (0)
#   define MTDA_PAUSE(v) do { for (int __delay = (v); __delay > 0; --__delay) { asm volatile("nop"); } } while (0)
# else
#   warning "Can't detect 'pause' (CPU-yield) instruction on the target platform. Specify MTDA_PAUSE() definition via compiler flags."
#   define MTDA_PAUSE(...) do { /* no-op: works, but not effective */ } while (0)
# endif

there are the CPU level yield instruction, it will be light than os level yield (thread::yiled())

chenqy4933 commented 1 year ago

is there any other well-implemented thread pool that we can learn from it? tomorrow, I benchmark the int4 matmul, using 2 threads, the result is not stable, but using only one thread, the result is stable. I also think the thread pool may have some bugs.

https://github.com/bshoshany/thread-pool ? I am not an active cpp programmer. But this library seems promising.

https://github.com/bshoshany/thread-pool/blob/master/include/BS_thread_pool.hpp#L628 there are locks, I worry about its performance.

xhebox commented 1 year ago

https://github.com/bshoshany/thread-pool/blob/master/include/BS_thread_pool.hpp#L628 there are locks, I worry about its performance.

Oops. Maybe something like coroutine pool will be better...?

Anyway, executing all tasks dynamically seems a bad idea for performance. I am thinking that we can just find a library that we can submit all tasks first, and execute later.

EDIT:

https://github.com/dougbinks/enkiTS/blob/master/example/Dependencies.cpp this lib seems interesting to me.