HanGuo97 / flute

Fast Matrix Multiplications for Lookup Table-Quantized LLMs
https://arxiv.org/abs/2407.10960
Apache License 2.0
188 stars 6 forks source link

How to build flute from scratch? #1

Closed LeiWang1999 closed 4 months ago

LeiWang1999 commented 4 months ago

Hi, all. I've noticed that the csrc of flute depends on cute, but it's not present in the submodules. Could you advise on which version of cutlass I should use, and would it be possible to provide the scripts to build Flute from scratch?

LeiWang1999 commented 4 months ago

by exporting necessary env:

export CPLUS_INCLUDE_PATH=/root/cutlass/include
export TORCH_CUDA_ARCH_LIST="8.0"

I'm still encountering some bugs:

/root/flute/flute/csrc/packbits_utils.hpp(312): error: static assertion failed
          detected during:
            instantiation of "void packbits_utils::DequantizationTraits<SourceEngine, SourceLayout, SourceEngine2, SourceLayout2, TargetEngine, TargetLayout, ScaleEngine, ScaleLayout, QuantMapEngine, QuantMapLayout, QuantMapEngine2, QuantMapLayout2, QuantMapEngine3, QuantMapLayout3, cute::Int<3>, QuantMapMode>::apply(const cute::Tensor<SourceEngine, SourceLayout> &, const cute::Tensor<SourceEngine2, SourceLayout2> &, cute::Tensor<TargetEngine, TargetLayout> &, const cute::Tensor<ScaleEngine, ScaleLayout> &, const cute::Tensor<QuantMapEngine, QuantMapLayout> &, const cute::Tensor<QuantMapEngine2, QuantMapLayout2> &, const cute::Tensor<QuantMapEngine3, QuantMapLayout3> &) [with SourceEngine=cute::ViewEngine<uint16_t *>, SourceLayout=cute::Layout<cute::tuple<cute::tuple<cute::_2, cute::_2>, cute::C<1>>, cute::tuple<cute::tuple<cute::_1, cute::_2>, cute::C<0>>>, SourceEngine2=cute::ViewEngine<uint16_t *>, SourceLayout2=cute::Layout<cute::tuple<cute::tuple<cute::_2, cute::_2>, cute::C<2>>, cute::tuple<cute::tuple<cute::_1, cute::_2>, cute::_4>>, TargetEngine=cute::ViewEngine<cutlass::half_t *>, TargetLayout=cute::Layout<cute::tuple<cute::tuple<cute::_2, cute::_2>, cute::C<16>>, cute::tuple<cute::tuple<cute::_1, cute::_2>, cute::_4>>, ScaleEngine=cute::ViewEngine<cutlass::half_t *>, ScaleLayout=cute::Layout<cute::tuple<cute::tuple<cute::_2, cute::_2>, cute::tuple<cute::C<2>, cute::C<2>, cute::_2, cute::C<2>>>, cute::tuple<cute::tuple<cute::_0, cute::_0>, cute::tuple<cute::_8, cute::C<16>, cute::C<32>, cute::C<4>>>>, QuantMapEngine=cute::ViewEngine<cute::smem_ptr<cutlass::half_t *>>, QuantMapLayout=cute::Layout<cute::tuple<cute::_8>, cute::tuple<cute::_1>>, QuantMapEngine2=cute::ViewEngine<cute::smem_ptr<__half2 *>>, QuantMapLayout2=cute::Layout<cute::tuple<cute::_64>, cute::tuple<cute::_1>>, QuantMapEngine3=cute::ArrayEngine<cutlass::half_t, 1>, QuantMapLayout3=cute::Layout<cute::tuple<cute::_1>, cute::tuple<cute::C<0>>>, QuantMapMode=config::QuantMapModeEnum::Vectorized]" 
(410): here
            instantiation of "void packbits_utils::dequantize<QuantMapMode,SourceEngine,SourceLayout,SourceEngine2,SourceLayout2,TargetEngine,TargetLayout,ScaleEngine,ScaleLayout,QuantMapEngine,QuantMapLayout,QuantMapEngine2,QuantMapLayout2,QuantMapEngine3,QuantMapLayout3,NumBits>(const cute::Tensor<SourceEngine, SourceLayout> &, const cute::Tensor<SourceEngine2, SourceLayout2> &, cute::Tensor<TargetEngine, TargetLayout> &&, const cute::Tensor<ScaleEngine, ScaleLayout> &, const cute::Tensor<QuantMapEngine, QuantMapLayout> &, const cute::Tensor<QuantMapEngine2, QuantMapLayout2> &, const cute::Tensor<QuantMapEngine3, QuantMapLayout3> &, NumBits) [with QuantMapMode=config::QuantMapModeEnum::Vectorized, SourceEngine=cute::ViewEngine<uint16_t *>, SourceLayout=cute::Layout<cute::tuple<cute::tuple<cute::_2, cute::_2>, cute::C<1>>, cute::tuple<cute::tuple<cute::_1, cute::_2>, cute::C<0>>>, SourceEngine2=cute::ViewEngine<uint16_t *>, SourceLayout2=cute::Layout<cute::tuple<cute::tuple<cute::_2, cute::_2>, cute::C<2>>, cute::tuple<cute::tuple<cute::_1, cute::_2>, cute::_4>>, TargetEngine=cute::ViewEngine<cutlass::half_t *>, TargetLayout=cute::Layout<cute::tuple<cute::tuple<cute::_2, cute::_2>, cute::C<16>>, cute::tuple<cute::tuple<cute::_1, cute::_2>, cute::_4>>, ScaleEngine=cute::ViewEngine<cutlass::half_t *>, ScaleLayout=cute::Layout<cute::tuple<cute::tuple<cute::_2, cute::_2>, cute::tuple<cute::C<2>, cute::C<2>, cute::_2, cute::C<2>>>, cute::tuple<cute::tuple<cute::_0, cute::_0>, cute::tuple<cute::_8, cute::C<16>, cute::C<32>, cute::C<4>>>>, QuantMapEngine=cute::ViewEngine<cute::smem_ptr<cutlass::half_t *>>, QuantMapLayout=cute::Layout<cute::tuple<cute::_8>, cute::tuple<cute::_1>>, QuantMapEngine2=cute::ViewEngine<cute::smem_ptr<__half2 *>>, QuantMapLayout2=cute::Layout<cute::tuple<cute::_64>, cute::tuple<cute::_1>>, QuantMapEngine3=cute::ArrayEngine<cutlass::half_t, 1>, QuantMapLayout3=cute::Layout<cute::tuple<cute::_1>, cute::tuple<cute::C<0>>>, NumBits=cute::C<3>]" 
/root/cutlass/include/cute/algorithm/tuple_algorithms.hpp(185): here
            instantiation of "void cute::for_each(T &&, F &&) [with T=std::integer_sequence<int, 0, 1, 2, 3>, F=lambda [](auto)->auto]" 
/root/flute/flute/csrc/qgemm_kernel.hpp(714): here
            instantiation of "void qgemm_device<Config,TileScheduler,T,TQ,T2>(const T *, const TQ *, T *, const T *, const T *, const T2 *, void *, TileScheduler) [with Config=config::GemmConfig<cutlass::half_t, uint16_t, cute::C<0>, cute::C<84>, cute::C<128>, cute::C<16>, cute::C<64>, cute::C<32>, cute::C<3>, cute::C<3>, cute::C<32>, config::QuantMapModeEnum::Vectorized, config::AccumulationModeEnum::Mixed, config::DecompositionModeEnum::StreamK, cute::C<2>, cute::C<1>>, TileScheduler=config::TileScheduler<config::GemmConfig<cutlass::half_t, uint16_t, cute::C<0>, cute::C<84>, cute::C<128>, cute::C<16>, cute::C<64>, cute::C<32>, cute::C<3>, cute::C<3>, cute::C<32>, config::QuantMapModeEnum::Vectorized, config::AccumulationModeEnum::Mixed, config::DecompositionModeEnum::StreamK, cute::C<2>, cute::C<1>>>, T=cutlass::half_t, TQ=uint16_t, T2=__half2]" 
/root/flute/flute/csrc/qgemm_kernel.hpp(881): here
            instantiation of "void qgemm_host<T,TQ,T2,Slices,Blocks,Threads,TileM,TileK,TileP,Stages,NumBits,GroupSize,QuantMapMode,AccumulationMode,DecompositionMode,G2STiledCopySizeS,MmaPrmK>(int, int, int, int, const T *, const TQ *, T *, const T *, const T *, const T2 *, void *, cudaStream_t) [with T=cutlass::half_t, TQ=uint16_t, T2=__half2, Slices=cute::C<0>, Blocks=cute::C<84>, Threads=cute::C<128>, TileM=cute::C<16>, TileK=cute::C<64>, TileP=cute::C<32>, Stages=cute::C<3>, NumBits=cute::C<3>, GroupSize=cute::C<32>, QuantMapMode=config::QuantMapModeEnum::Vectorized, AccumulationMode=config::AccumulationModeEnum::Mixed, DecompositionMode=config::DecompositionModeEnum::StreamK, G2STiledCopySizeS=cute::C<2>, MmaPrmK=cute::C<1>]" 
/root/flute/flute/csrc/qgemm_kernel_generated.cu(106): here
            instantiation of "void _qgemm<SMs,T,TQ,T2,NumBits,GroupSize>(int, int, int, int, const T *, const TQ *, T *, const T *, const T *, const T2 *, void *, cudaStream_t) [with SMs=cute::C<84>, T=cutlass::half_t, TQ=uint16_t, T2=__half2, NumBits=cute::C<3>, GroupSize=cute::C<32>]" 
/root/flute/flute/csrc/qgemm_kernel_generated.cu(4081): here
HanGuo97 commented 4 months ago

Hi, the README was updated to include the CUTLASS version (v3.4.1), sorry for the confusion!

The Github Actions section might provide (slightly) more details. But after getting the right version of CUTLASS, I hope the rest is very straightforward.

LeiWang1999 commented 4 months ago

Thanks @HanGuo97 , amazing project and a great example of cute!

HanGuo97 commented 4 months ago

Thanks for the kind words! Obviously, we benefited a lot from your project + helps as well --- highly appreciated!