TimDettmers / bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.
https://huggingface.co/docs/bitsandbytes/main/en/index
MIT License
5.73k stars 584 forks source link

Adapt to PPC64LE CPU architecure #652

Open pengbohua opened 11 months ago

pengbohua commented 11 months ago

Hi, Tim. Thank you for bring such a cool quantization tool to the community. However, I met a issue when trying to compile bnb from source code on PPC64LE CPUs of my school.

The errors regarding x86_intrinsics which can be disabled via the flag (-DNO_WARN_X86_INTRINSICS) used, are there for a reason unfortunately, as the intrinsics which result in errors later on are not available in the GCC/G++ build on ppc64le CPUs.

The warning refers to a comment in the header file, which reads as follows:

#ifndef NO_WARN_X86_INTRINSICS 
/* This header is distributed to simplify porting x86_64 code that 
   makes explicit use of Intel intrinsics to powerpc64le. 
   It is the user's responsibility to determine if the results are 
   acceptable and make additional changes as necessary. 
   Note that much code that uses Intel intrinsics can be rewritten in 
   standard C or GNU C extensions, which are more portable and better 
   optimized across multiple targets. 

   In the specific case of X86 MMX (__m64) intrinsics, the PowerPC 
   target does not support a native __vector_size__ (8) type.  Instead 
   we typedef __m64 to a 64-bit unsigned long long, which is natively 
   supported in 64-bit mode.  This works well for the _si64 and some 
   _pi32 operations, but starts to generate long sequences for _pi16 
   and _pi8 operations.  For those cases it better (faster and 
   smaller code) to transfer __m64 data to the PowerPC vector 128-bit 
   unit, perform the operation, and then transfer the result back to 
   the __m64 type. This implies that the direct register move 
   instructions, introduced with power8, are available for efficient 
   implementation of these transfers. 

   Most MMX intrinsic operations can be performed efficiently as 
   C language 64-bit scalar operation or optimized to use the newer 
   128-bit SSE/Altivec operations.  We recomend this for new 
   applications.  */ 
#error "Please read comment above.  Use -DNO_WARN_X86_INTRINSICS to disable this error." 
#endif

The bitsandbytes code, inside include/SIMD.h and include/Algo-Direct2.h use these intrinsics, which are not portable across CPU architectures. These are included by BinSearch.h, which is included by common.h, cpu_ops.cpp and ops.cu. without any CPU architecture guarding. The PPC64LE version(s) of GCC etc do attempt to automatically convert these intrinsics to the PPC64LE equivalents when DNO_WARN_X86_INTRINSICS is set, but unfortunately this does not succeed for this code, and it will only work on x86_64 CPUs.

For this to build on PPC64LE, this will require changes to the source code, to either replace these intrinsics with portable versions which are PPC64LE compatible, or guard them out with macros so they can be disabled, and a PPC64LE compatible alternative put in place.

STAH commented 6 months ago

Hello,

Any chance to have PPC64LE support for this library?

Thanks you

ma-chengyuan commented 6 months ago

I ran into the same issue. It seems that there are some redundant includes in ops.cu. I was able to build after commenting out these lines (and after setting NO_WARN_X86_INTRINSICS):

diff --git a/csrc/ops.cu b/csrc/ops.cu
index 9776121..1137380 100644
--- a/csrc/ops.cu
+++ b/csrc/ops.cu
@@ -7,12 +7,12 @@
 #include <kernels.cuh>
 #include <cub/device/device_scan.cuh>
 #include <limits>
-#include <BinSearch.h>
+// #include <BinSearch.h>
 #include <cassert>
-#include <common.h>
+// #include <common.h>

-using namespace BinSearch;
+// using namespace BinSearch;
 using std::cout;
 using std::endl;

While it builds, I'm still testing if the built binary works as intended.

mgiessing commented 6 months ago

Not 100% sure, but I guess we have a similar issue to what is describe here https://github.com/dealii/dealii/issues/7328#issuecomment-429466699 and the reason here: https://github.com/dealii/dealii/issues/7328#issuecomment-444507314

Using -DNO_WARN_X86_INTRINSICS should translate Intel intrinsic code to VSX code (which it does), but throwing a lot of identifier "__builtin_*" is undefined errors:

[...]
/opt/at15.0/lib/gcc/powerpc64le-linux-gnu/11.4.1/include/mmintrin.h(178): error: too many initializer values
          { __m1, __m2 };
                  ^

/opt/at15.0/lib/gcc/powerpc64le-linux-gnu/11.4.1/include/mmintrin.h(182): error: identifier "__builtin_vec_packs" is undefined
    __vresult = __builtin_vec_packs (__vm1, __vm1);
                ^

/opt/at15.0/lib/gcc/powerpc64le-linux-gnu/11.4.1/include/mmintrin.h(183): error: expression must have pointer-to-object type but it has type "int"
    return (__m64) ((__attribute__((altivec(vector__))) long long) __vresult)[0];
                                                                              ^

/opt/at15.0/lib/gcc/powerpc64le-linux-gnu/11.4.1/include/mmintrin.h(203): error: too many initializer values
          { __m1, __m2 };
                  ^

/opt/at15.0/lib/gcc/powerpc64le-linux-gnu/11.4.1/include/mmintrin.h(207): error: identifier "__builtin_vec_packs" is undefined
    __vresult = __builtin_vec_packs (__vm1, __vm1);
                ^

[...]

If I understand this correct nvcc does not know about alitvec and should therefore not see any SIMD code.

I'm wondering more about the fact that include/SIMD.h and include/Algo-Direct2.h are using x86 intrinsics, but there is no scalar code for this as a fallback solution for unsupported platforms such as ppc64le or aarch64

Does it make sense to create a PR using scalar code or am I missing something?

WiedPakusa commented 6 months ago

I would also appreciate a solution in this regard.