NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.46k stars 924 forks source link

[BUG] Autovectorized copy fails with shape (_2, _3) #1499

Closed YichengDWu closed 2 months ago

YichengDWu commented 5 months ago

Describe the bug

Automatic vectorization of copying doesn't account for shape divisibility. When attempting to copy a tensor with the layout (_2,_3):(_1, _2), the greatest common vector length is 6. However, it's vectorized at 128 bits, which means copying four elements at once. This approach doesn't work for a tensor with a size of 6.

Steps/Code to reproduce bug

#include <vector>

#include <cute/tensor.hpp>
#include <cute/layout.hpp>

using namespace cute;

int main() {
    auto mem_layout = make_layout(make_shape(Int<2>{}, Int<3>{}));
    print_layout(mem_layout);

    std::vector<int> src_buffer(size(mem_layout));
    std::vector<int> dst_buffer(size(mem_layout));

    auto src = make_tensor(src_buffer.data(), mem_layout);

    for (int t = 0; t < size(mem_layout); t++) {
      src[t] = t;

    }
    print_tensor(src);

    auto dst = make_tensor(dst_buffer.data(), mem_layout);
    copy(src, dst);
    print_tensor(dst);

  return 0;
}

I got the following error:

error: static assertion failed due to requirement 'C<3>::value % C<2>::value == 0 || C<2>::value % C<3>::value == 0': Static shape_div failure
  405 |     static_assert(IntTupleA::value % IntTupleB::value == 0 || IntTupleB::value % IntTupleA::value == 0, "Static shape_div failure");
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/ethan/cutlass/include/cute/layout.hpp:1680:24: note: in instantiation of function template specialization 'cute::shape_div<cute::C<3>, cute::C<2>>' requested here
 1680 |     return make_layout(shape_div(shape,  shape_div(Int<N>{}, abs(stride))),
      |                        ^
/home/ethan/cutlass/include/cute/layout.hpp:1676:86: note: in instantiation of function template specialization 'cute::upcast<4, cute::C<3>, cute::C<2>>' requested here
 1676 |     return transform_layout(shape, stride, [](auto const& s, auto const& d) { return upcast<N>(s,d); });
      |                                                                                      ^
/home/ethan/cutlass/include/cute/layout.hpp:704:22: note: in instantiation of function template specialization 'cute::upcast(const cute::tuple<cute::C<2>, cute::C<3>> &, const cute::tuple<cute::C<1>, cute::C<2>> &)::(anonymous class)::operator()<cute::C<3>, cute::C<2>>' requested here
  704 |   return make_layout(f(get<I>(t0),get<I>(t1))..., get<I0>(t0)..., get<I1>(t1)...);
      |                      ^
/home/ethan/cutlass/include/cute/layout.hpp:725:18: note: in instantiation of function template specialization 'cute::detail::transform_layout<cute::tuple<cute::C<2>, cute::C<3>>, cute::tuple<cute::C<1>, cute::C<2>>, (lambda at /home/ethan/cutlass/include/cute/layout.hpp:1676:44) &, 0, 1>' requested here
  725 |   return detail::transform_layout(t0, t1, f, make_seq<R>{}, make_range<R,R0>{}, make_range<R,R1>{});
      |                  ^
/home/ethan/cutlass/include/cute/layout.hpp:1676:12: note: in instantiation of function template specialization 'cute::transform_layout<cute::tuple<cute::C<2>, cute::C<3>>, cute::tuple<cute::C<1>, cute::C<2>>, (lambda at /home/ethan/cutlass/include/cute/layout.hpp:1676:44)>' requested here
 1676 |     return transform_layout(shape, stride, [](auto const& s, auto const& d) { return upcast<N>(s,d); });
      |            ^
/home/ethan/cutlass/include/cute/layout.hpp:1696:10: note: (skipping 2 contexts in backtrace; use -ftemplate-backtrace-limit=0 to see all)
 1696 |   return upcast<N>(layout.shape(), layout.stride());
      |          ^
/home/ethan/cutlass/include/cute/tensor.hpp:658:21: note: in instantiation of function template specialization 'cute::recast_layout<int, const cutlass::uint128_t, cute::tuple<cute::C<2>, cute::C<3>>, cute::tuple<cute::C<1>, cute::C<2>>>' requested here
  658 |   auto new_layout = recast_layout<OldType,NewType>(old_layout);
      |                     ^
/home/ethan/cutlass/include/cute/algorithm/copy.hpp:210:20: note: in instantiation of function template specialization 'cute::recast<const cutlass::uint128_t, const cute::Tensor<cute::ViewEngine<int *>, cute::Layout<cute::tuple<cute::C<2>, cute::C<3>>, cute::tuple<cute::C<1>, cute::C<2>>>> &>' requested here
  210 |     Tensor src_v = recast<SrcVecType>(src);
      |                    ^
/home/ethan/cutlass/include/cute/algorithm/copy.hpp:283:12: note: in instantiation of function template specialization 'cute::copy_vec<cutlass::uint128_t, cute::ViewEngine<int *>, cute::Layout<cute::tuple<cute::C<2>, cute::C<3>>, cute::tuple<cute::C<1>, cute::C<2>>>, cute::ViewEngine<int *>, cute::Layout<cute::tuple<cute::C<2>, cute::C<3>>, cute::tuple<cute::C<1>, cute::C<2>>>>' requested here
  283 |     return copy_vec<uint_bit_t<vec_bits>>(src, dst);
      |            ^
/home/ethan/cutlass/include/cute/algorithm/copy.hpp:297:10: note: in instantiation of function template specialization 'cute::copy<8, cute::ViewEngine<int *>, cute::Layout<cute::tuple<cute::C<2>, cute::C<3>>, cute::tuple<cute::C<1>, cute::C<2>>>, cute::ViewEngine<int *>, cute::Layout<cute::tuple<cute::C<2>, cute::C<3>>, cute::tuple<cute::C<1>, cute::C<2>>>>' requested here
  297 |   return copy(AutoVectorizingCopy{}, src, dst);
      |          ^
test.cpp:24:5: note: in instantiation of function template specialization 'cute::copy<cute::ViewEngine<int *>, cute::Layout<cute::tuple<cute::C<2>, cute::C<3>>, cute::tuple<cute::C<1>, cute::C<2>>>, cute::ViewEngine<int *>, cute::Layout<cute::tuple<cute::C<2>, cute::C<3>>, cute::tuple<cute::C<1>, cute::C<2>>>>' requested here
   24 |     copy(src, dst);
      |     ^

Expected behavior

copy(src, dst) should just work. Internally, it should be able to compute a correct predicate and use that to do the copy.

Environment details (please complete the following information):

thakkarV commented 5 months ago

We have found similar issues internally with "domain alignment" of copies. This is being worked on right now

github-actions[bot] commented 4 months ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

YichengDWu commented 3 months ago

Still needs to be addressed.

thakkarV commented 3 months ago

will be fixed in 3.5.1. ETA is hard to predict but hopefully next couple of weeks

thakkarV commented 2 months ago

@YichengDWu please verify and close.

YichengDWu commented 2 months ago

I can't verify it now since I no longer have an NVIDIA card. Thank you for fixing this!