Closed LeiWang1999 closed 4 months ago
by exporting necessary env:
export CPLUS_INCLUDE_PATH=/root/cutlass/include
export TORCH_CUDA_ARCH_LIST="8.0"
I'm still encountering some bugs:
/root/flute/flute/csrc/packbits_utils.hpp(312): error: static assertion failed
detected during:
instantiation of "void packbits_utils::DequantizationTraits<SourceEngine, SourceLayout, SourceEngine2, SourceLayout2, TargetEngine, TargetLayout, ScaleEngine, ScaleLayout, QuantMapEngine, QuantMapLayout, QuantMapEngine2, QuantMapLayout2, QuantMapEngine3, QuantMapLayout3, cute::Int<3>, QuantMapMode>::apply(const cute::Tensor<SourceEngine, SourceLayout> &, const cute::Tensor<SourceEngine2, SourceLayout2> &, cute::Tensor<TargetEngine, TargetLayout> &, const cute::Tensor<ScaleEngine, ScaleLayout> &, const cute::Tensor<QuantMapEngine, QuantMapLayout> &, const cute::Tensor<QuantMapEngine2, QuantMapLayout2> &, const cute::Tensor<QuantMapEngine3, QuantMapLayout3> &) [with SourceEngine=cute::ViewEngine<uint16_t *>, SourceLayout=cute::Layout<cute::tuple<cute::tuple<cute::_2, cute::_2>, cute::C<1>>, cute::tuple<cute::tuple<cute::_1, cute::_2>, cute::C<0>>>, SourceEngine2=cute::ViewEngine<uint16_t *>, SourceLayout2=cute::Layout<cute::tuple<cute::tuple<cute::_2, cute::_2>, cute::C<2>>, cute::tuple<cute::tuple<cute::_1, cute::_2>, cute::_4>>, TargetEngine=cute::ViewEngine<cutlass::half_t *>, TargetLayout=cute::Layout<cute::tuple<cute::tuple<cute::_2, cute::_2>, cute::C<16>>, cute::tuple<cute::tuple<cute::_1, cute::_2>, cute::_4>>, ScaleEngine=cute::ViewEngine<cutlass::half_t *>, ScaleLayout=cute::Layout<cute::tuple<cute::tuple<cute::_2, cute::_2>, cute::tuple<cute::C<2>, cute::C<2>, cute::_2, cute::C<2>>>, cute::tuple<cute::tuple<cute::_0, cute::_0>, cute::tuple<cute::_8, cute::C<16>, cute::C<32>, cute::C<4>>>>, QuantMapEngine=cute::ViewEngine<cute::smem_ptr<cutlass::half_t *>>, QuantMapLayout=cute::Layout<cute::tuple<cute::_8>, cute::tuple<cute::_1>>, QuantMapEngine2=cute::ViewEngine<cute::smem_ptr<__half2 *>>, QuantMapLayout2=cute::Layout<cute::tuple<cute::_64>, cute::tuple<cute::_1>>, QuantMapEngine3=cute::ArrayEngine<cutlass::half_t, 1>, QuantMapLayout3=cute::Layout<cute::tuple<cute::_1>, cute::tuple<cute::C<0>>>, QuantMapMode=config::QuantMapModeEnum::Vectorized]"
(410): here
instantiation of "void packbits_utils::dequantize<QuantMapMode,SourceEngine,SourceLayout,SourceEngine2,SourceLayout2,TargetEngine,TargetLayout,ScaleEngine,ScaleLayout,QuantMapEngine,QuantMapLayout,QuantMapEngine2,QuantMapLayout2,QuantMapEngine3,QuantMapLayout3,NumBits>(const cute::Tensor<SourceEngine, SourceLayout> &, const cute::Tensor<SourceEngine2, SourceLayout2> &, cute::Tensor<TargetEngine, TargetLayout> &&, const cute::Tensor<ScaleEngine, ScaleLayout> &, const cute::Tensor<QuantMapEngine, QuantMapLayout> &, const cute::Tensor<QuantMapEngine2, QuantMapLayout2> &, const cute::Tensor<QuantMapEngine3, QuantMapLayout3> &, NumBits) [with QuantMapMode=config::QuantMapModeEnum::Vectorized, SourceEngine=cute::ViewEngine<uint16_t *>, SourceLayout=cute::Layout<cute::tuple<cute::tuple<cute::_2, cute::_2>, cute::C<1>>, cute::tuple<cute::tuple<cute::_1, cute::_2>, cute::C<0>>>, SourceEngine2=cute::ViewEngine<uint16_t *>, SourceLayout2=cute::Layout<cute::tuple<cute::tuple<cute::_2, cute::_2>, cute::C<2>>, cute::tuple<cute::tuple<cute::_1, cute::_2>, cute::_4>>, TargetEngine=cute::ViewEngine<cutlass::half_t *>, TargetLayout=cute::Layout<cute::tuple<cute::tuple<cute::_2, cute::_2>, cute::C<16>>, cute::tuple<cute::tuple<cute::_1, cute::_2>, cute::_4>>, ScaleEngine=cute::ViewEngine<cutlass::half_t *>, ScaleLayout=cute::Layout<cute::tuple<cute::tuple<cute::_2, cute::_2>, cute::tuple<cute::C<2>, cute::C<2>, cute::_2, cute::C<2>>>, cute::tuple<cute::tuple<cute::_0, cute::_0>, cute::tuple<cute::_8, cute::C<16>, cute::C<32>, cute::C<4>>>>, QuantMapEngine=cute::ViewEngine<cute::smem_ptr<cutlass::half_t *>>, QuantMapLayout=cute::Layout<cute::tuple<cute::_8>, cute::tuple<cute::_1>>, QuantMapEngine2=cute::ViewEngine<cute::smem_ptr<__half2 *>>, QuantMapLayout2=cute::Layout<cute::tuple<cute::_64>, cute::tuple<cute::_1>>, QuantMapEngine3=cute::ArrayEngine<cutlass::half_t, 1>, QuantMapLayout3=cute::Layout<cute::tuple<cute::_1>, cute::tuple<cute::C<0>>>, NumBits=cute::C<3>]"
/root/cutlass/include/cute/algorithm/tuple_algorithms.hpp(185): here
instantiation of "void cute::for_each(T &&, F &&) [with T=std::integer_sequence<int, 0, 1, 2, 3>, F=lambda [](auto)->auto]"
/root/flute/flute/csrc/qgemm_kernel.hpp(714): here
instantiation of "void qgemm_device<Config,TileScheduler,T,TQ,T2>(const T *, const TQ *, T *, const T *, const T *, const T2 *, void *, TileScheduler) [with Config=config::GemmConfig<cutlass::half_t, uint16_t, cute::C<0>, cute::C<84>, cute::C<128>, cute::C<16>, cute::C<64>, cute::C<32>, cute::C<3>, cute::C<3>, cute::C<32>, config::QuantMapModeEnum::Vectorized, config::AccumulationModeEnum::Mixed, config::DecompositionModeEnum::StreamK, cute::C<2>, cute::C<1>>, TileScheduler=config::TileScheduler<config::GemmConfig<cutlass::half_t, uint16_t, cute::C<0>, cute::C<84>, cute::C<128>, cute::C<16>, cute::C<64>, cute::C<32>, cute::C<3>, cute::C<3>, cute::C<32>, config::QuantMapModeEnum::Vectorized, config::AccumulationModeEnum::Mixed, config::DecompositionModeEnum::StreamK, cute::C<2>, cute::C<1>>>, T=cutlass::half_t, TQ=uint16_t, T2=__half2]"
/root/flute/flute/csrc/qgemm_kernel.hpp(881): here
instantiation of "void qgemm_host<T,TQ,T2,Slices,Blocks,Threads,TileM,TileK,TileP,Stages,NumBits,GroupSize,QuantMapMode,AccumulationMode,DecompositionMode,G2STiledCopySizeS,MmaPrmK>(int, int, int, int, const T *, const TQ *, T *, const T *, const T *, const T2 *, void *, cudaStream_t) [with T=cutlass::half_t, TQ=uint16_t, T2=__half2, Slices=cute::C<0>, Blocks=cute::C<84>, Threads=cute::C<128>, TileM=cute::C<16>, TileK=cute::C<64>, TileP=cute::C<32>, Stages=cute::C<3>, NumBits=cute::C<3>, GroupSize=cute::C<32>, QuantMapMode=config::QuantMapModeEnum::Vectorized, AccumulationMode=config::AccumulationModeEnum::Mixed, DecompositionMode=config::DecompositionModeEnum::StreamK, G2STiledCopySizeS=cute::C<2>, MmaPrmK=cute::C<1>]"
/root/flute/flute/csrc/qgemm_kernel_generated.cu(106): here
instantiation of "void _qgemm<SMs,T,TQ,T2,NumBits,GroupSize>(int, int, int, int, const T *, const TQ *, T *, const T *, const T *, const T2 *, void *, cudaStream_t) [with SMs=cute::C<84>, T=cutlass::half_t, TQ=uint16_t, T2=__half2, NumBits=cute::C<3>, GroupSize=cute::C<32>]"
/root/flute/flute/csrc/qgemm_kernel_generated.cu(4081): here
Hi, the README was updated to include the CUTLASS version (v3.4.1
), sorry for the confusion!
The Github Actions section might provide (slightly) more details. But after getting the right version of CUTLASS, I hope the rest is very straightforward.
Thanks @HanGuo97 , amazing project and a great example of cute!
Thanks for the kind words! Obviously, we benefited a lot from your project + helps as well --- highly appreciated!
Hi, all. I've noticed that the csrc of flute depends on cute, but it's not present in the submodules. Could you advise on which version of cutlass I should use, and would it be possible to provide the scripts to build Flute from scratch?