NVIDIA / cccl

CUDA Core Compute Libraries
Other
1.13k stars 133 forks source link

variadic template implementation of thrust::tuple #695

Closed andrewcorrigan closed 9 months ago

andrewcorrigan commented 10 years ago

thrust::tuple does not use variadic templates. A variadic template implementation of thrust::tuple would not just make the internal implementation more elegant, there are serious practical issues with the current thrust::tuple implementation:

  1. Only tuples of size 10 may be used. Nested tuples are often an inconvenient workaround and are not always feasible.
  2. Compilation speed is poor, which could otherwise be significantly improved by using variadic templates. [1]
  3. Compiler error diagnostics are often unreadable due to thrust::null_type everywhere.

In consideration of the fact that C++11 support in nvcc is still in the release candidate, undocumented, and requires host compiler support, I don't expect C++03 support to be dropped anytime soon. Instead, can a user-supplied compile-time macro (e.g. THRUST_USE_CXX11) or auto-detection of __cplusplus >= 201103L please be considered to switch to an alternate implementation?

[1] http://www.jot.fm/issues/issue_2008_02/article2.pdf

andrewcorrigan commented 9 years ago

See also the discussion on the mailing list [1].

[1] https://groups.google.com/d/msg/thrust-users/FKf5n11OuAQ/1tmoQABfqNwJ

andrewcorrigan commented 9 years ago

I've begun a variadic branch in my repository. There are numerous tasks that must be accomplished to have a variadic thrust::tuple. I've already begun making progress in my first commit. However, so far that was just low-hanging fruit -- I am neither a Thrust nor C++11 expert, I am just a user in desperate need of a variadic thrust::tuple. I've only just started learning how to implement variadic templates and functions over the past few days, so I welcome guidance or hands-on collaboration on this.

First, the rest of Thrust, needs to be made compatible with the public interface of tuple, and no longer rely on the implementation details of thrust::tuple.

Then, an equivalent to <tuple> needs to be provided. I am aware currently of two options:

I have a question: should tuple_cat be allowed to make use of the internal implementation details of tuple?

Finally, I want to mention, that in my branch I am simply replacing the non-variadic code. I don't expect that approach to be taken with mainline Thrust. If and when this code is ready to be merged, I would only expect it to be experimental functionality guarded by __cplusplus >= 201103L or some other Thrust-specific macro that a user must opt-in to.

andrewcorrigan commented 9 years ago

I have fixed all the relevant code in the rest of Thrust to support variadic tuples.

I then tried the two approaches to replace non-variadic thrust::tuple. Neither is currently working, although the first is further along, and that is what I have committed to the variadic Thrust branch.

/thrust/thrust/tuple.h:57:35: Implicit instantiation of undefined template 'std::__1::tuple_size<thrust::detail::tuple_of_iterator_references<int, thrust::device_reference<int> > >'

I tried various things to define tuple_size for tuple_of_iterator_references, but nothing seems to work.

andrewcorrigan commented 9 years ago

Thrust now has variadic tuple support. The tuple implementation from agency did the trick. You need to define THRUST_VARIADIC_TUPLE to use the variadic implementation.

Example:

#include <sstream>
#include <iostream>
#define THRUST_VARIADIC_TUPLE
#include <thrust/tuple.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/zip_iterator.h>

int main(int argc, char** argv)
{
    auto zip11_start =
    thrust::make_zip_iterator(thrust::make_tuple(thrust::make_counting_iterator(1),
                                                 thrust::make_counting_iterator(2),
                                                 thrust::make_counting_iterator(3),
                                                 thrust::make_counting_iterator(4),
                                                 thrust::make_counting_iterator(5),
                                                 thrust::make_counting_iterator(6),
                                                 thrust::make_counting_iterator(7),
                                                 thrust::make_counting_iterator(8),
                                                 thrust::make_counting_iterator(9),
                                                 thrust::make_counting_iterator(10),
                                                 thrust::make_counting_iterator(11) ));

    auto a = *zip11_start;

    std::cout << "C++" << thrust::tuple_size<decltype(a)>::value << " enables tuples of length " << thrust::tuple_size<decltype(a)>::value << std::endl;
    std::cout << "For example: " << a << std::endl;

    auto b = thrust::tuple_cat(a,a);
    std::cout << "Thankfully, we don't have to wait for C++" << thrust::tuple_size<decltype(b)>::value << " for tuples of length " << thrust::tuple_size<decltype(b)>::value << std::endl;
    std::cout << "We already have them right now: " << b << std::endl; 

    return 0;
}

Gives output:

% clang++ -std=c++11 test.cpp -I/Users/acorriga/code/thrust
% ./a.out
C++11 enables tuples of length 11
For example: (1 2 3 4 5 6 7 8 9 10 11)
Thankfully we don't have to wait for C++22 for tuples of length 22
We already have them right now: (1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11)

Known issues:

             class = typename std::enable_if<
               (sizeof...(Types) == 1) &&
               std::is_constructible<Type1,const UType1&>::value // XXX fill in the rest of these
             >::type>
andrewcorrigan commented 9 years ago

I have updated my variadic branch [1] using the latest non-recursive, variadic tuple code from [2].

Using the new variadic tuple, my code sped up by 50% when using the cpp backend with clang. I encourage other Thrust users to give this a try. You might be pleasantly surprised by a similar performance boost in your code if you use tuple a lot like my code does!

Known issues: The known issues mentioned previously are resolved. I compiled and ran with g++-4.8, AppleClang 6.0 on the CPU backend, and nvcc 7.0 rc on Mac, and the code works. A new known issue is that compiling on Linux, with GCC triggers an internal compiler error. I just submitted this to NVIDIA with bug-id 1608271:

thrust/thrust/detail/tuple/variadic_tuple.h(395): internal error: assertion failed at: "/dvs/p4/build/sw/rel/gpu_drv/r346/r346_00/drivers/compiler/edg/EDG_4.9/src/expr.c", line 3108

[1] https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/tree/variadic [2] https://github.com/jaredhoberock/tuple

schiller-manuel commented 9 years ago

When will this variadic template be merged into the thrust master branch?

andrewcorrigan commented 9 years ago

I've been wondering the same thing for a long time. @jaredhoberock any word on a roadmap for C++11, including variadic tuple, support in Thrust?

schiller-manuel commented 9 years ago

If thrust::tuple was variadic, metaprogramming would be a lot easier. @andrewcorrigan I am currently using your variadic branch, it is really working great.

andrewcorrigan commented 9 years ago

I should mention, I haven't updated the variadic branch since Feb. 5. I am pretty sure that there have been changes and bug fixes since then in my local copy of Thrust based on the latest https://github.com/jaredhoberock/tuple. I should get you the latest version. If you don't see an update from me in the next few days, please ping me.

andrewcorrigan commented 9 years ago

@jaredhoberock Will you entertain a pull request for variadic tuple as well?

jaredhoberock commented 9 years ago

Variadic tuple is a huge, risky change. Is there any evidence that it is ready or that the compiler is ready for it?

Here are some questions I'd like to see answered before contemplating how to merge variadic tuple:

andrewcorrigan commented 9 years ago

Quick response off the top of my head.

Variadic tuple is a huge, risky change.

What's risky about it? You already implemented variadic tuple, I already got it integrated. We can just leave it optional if you don't want it be the default behavior right away.

Is there any evidence that it is ready or that the compiler is ready for it?

Yes... See above. I demonstrated an example above. We are both aware of a compiler workaround to get this compiling with CUDA 7, and that the compiler bug is fixed in CUDA 7.5.

What bug fixes exist which have not been merged from jaredhoberock/tuple?

I already have them merged in my local copy. If you are interested in moving forward I will push the latest very soon (otherwise, I have limited time to work on stuff that you may or may not be interested in pursuing).

How does compilation performance change after enabling variadic tuple?

Compilation performance is almost exclusively a function of compiler frontend (clang is great, gcc is good, icpc and nvcc are terrible). Qualitatively: It's atrocious either way with nvcc. It's fine either way with clang or gcc on the host side.

How does performance of algorithms change?

My Thrust-based code sped up 50%. That is huge... I am guessing you want more fine-grained comparisons. As per my post here https://github.com/thrust/thrust/pull/488#issuecomment-71380706, myself and others I work with have tried to run the performance tests built-in to Thrust, but with no luck. Would you be willing to help with this? I can get you precise error messages and description of what we tried.

Do any unit tests fail?

As per my post here https://github.com/thrust/thrust/pull/488#issuecomment-71380706, I have not had success getting the performance / unit tests to actually run. Would you be willing to help with this? I can get you precise error messages and description of what we tried.

What is the delta in total lines of code? If the old version of tuple was replaced with the new, would there be a net decrease in code size?

Net decrease.

There is a lot of repetitive code in master due to the lack of variadic templates. See for example this commit: [https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/commit/d000fcf7f1302cdaa7b615dc3c903fadb0cbc18f. Variadic tuple [https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/blob/variadic/thrust/detail/tuple/variadic_tuple.h] itself is ~700 lines. This replaces, for example [https://github.com/thrust/thrust/blob/master/thrust/detail/tuple.inl] which is larger.

jaredhoberock commented 9 years ago

If you would care to file a new issue for getting scons to work on Mac, we can discuss the problems you encountered with unit tests there.

Besides the functional correctness of the variadic tuple implementation, I am concerned about the compile time and runtime performance of algorithms which use tuples internally in their implementation. Tuple is ubiquitous within Thrust's implementation. That means that casual users of Thrust who are uninterested in using variadic tuple will be indirectly affected by them. For example, if introducing variadic tuple were to increase the compile time of, say, reduce_by_key for no additional runtime performance benefit, then it is premature to do it.

In order to be convinced that it is safe to do this merge, we need to convince ourselves that this change does no harm.

andrewcorrigan commented 9 years ago

I want to say that it's wonderful to finally get some form of feedback on all this. I was confused when I went through the trouble of getting this new significant functionality working, observed a significant performance boost on top of that, and yet there was complete silence in response.

  1. Can you please confirm: Is the THRUST_VARIADIC_TUPLE option not viable? In other words, would you prefer that #if __cplusplus >= 201103L, then variadic tuple is used automatically? If we go with the optional approach, then there will be no risk of the performance issues you are concerned about.
  2. I also wonder, how would you want to handle commits like (https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/commit/d000fcf7f1302cdaa7b615dc3c903fadb0cbc18f https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/commit/47cb40380b87a66d1a7fe230a71380337e30b3c9 https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/commit/ca5873209812296100b7b4a1ee8a5d9d465ed1d6)? That code is not the tuple implementation itself, but other code in Thrust that assumes tuple of size 10. Do you want two versions of everything? If it were up to me, I would not, and would want a separate C++11 branch entirely (which is what my personal copy of Thrust is already at this point), and let the C++03 version go into maintenance mode.
  3. What do you want to do in terms of namespace. In https://github.com/jaredhoberock/tuple you put a lot of stuff in std, even when __TUPLE_NAMESPACE is thrust. When I integrated your variadic tuple with Thrust, I changed this so everything goes in thrust, since the rest of Thrust assumes this. What approach do you want to take, put everything in Thrust? Or do you want to clean up the rest of Thrust, so that it uses, for example, std::get instead of thrust::get?
jaredhoberock commented 9 years ago
  1. Yes, as discussed in NVIDIA/thrust#624, I do not think a new preprocessor switch is viable.
  2. Yes, all code in Thrust which assumes there is a maximum tuple size needs to be generalized as a corollary to this change. Unless you can think of some other way to achieve the same result, it seems like there needs to be two versions of everything. I do not think a separate branch for C++11 is viable. I think I would duplicate things on a file-by-file basis. I would take the existing file names and suffix them with _cpp03.h and suffix the new stuff with _cpp11.h.
  3. tuple needs to go in namespace thrust. We should just keep it simple for now and simply generalize Thrust's current implementation to use variadics without introducing any new functionality.
andrewcorrigan commented 9 years ago

Thanks. I will start using file suffixes to handle the differences. If there is anything else you already don't like that we haven't already discussed, or anything you discover if/when you look at the code, can you please let me know?

Aside from any possible performance issues putting things on hold, I want this code to be as ready-to-merge as possible.

jaredhoberock commented 9 years ago

I don't think we've discussed unit tests -- there will need to be new tests which test the functionality of tuples larger than 10 elements. These should go in the tuple-specific unit tests and also the tests for zip_iterator.

andrewcorrigan commented 9 years ago

What about tuple_io and tuple_cat? We either need to get the open pull request merged (less work on my end), or that also needs to be folded into one massive PR.

jaredhoberock commented 9 years ago

Let's keep the scope of this effort focused on variadic tuple for now in order to maximize the chance of its success.

andrewcorrigan commented 9 years ago

So then are you going to merge the older PR first? If you don't merge that first, I have to redo a lot of variadic tuple work since variadic tuple was done according to the new directory structure you asked me to follow in the tuple_io PR.

In terms of duplicating things on a file-by-file basis, what about thrust/detail/type_traits.h for example. That is a large file, and only the implementation of and_ and or_ needs to be done using variadic templates. Can that be done with an #ifdef like was done to implement result_of?

andrewcorrigan commented 9 years ago

@dachziegel I just pushed to https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/tree/variadic, which includes the latest from https://github.com/jaredhoberock/tuple, as well as a workaround for a bug in CUDA 7.0. See https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/commits/variadic for details. Please let me know if it works for you or not.

schiller-manuel commented 9 years ago

Thanks Andrew, I will try it.

schiller-manuel commented 9 years ago

Should this also work with THRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CPP?

I tried this code: thrust_variadic_tuple_test.cpp

It fails to compile for me using

clang++-3.7 -I/tmp/thrust-variadic -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CPP -std=c++11 thrust_variadic_tuple_test.cpp

However, using std::tuple (hacky #define thrust std), it compiles.

jaredhoberock commented 9 years ago

Yes, I agree that integrating variadic tuple will require quite a bit of work. Nonetheless, I do not wish to condition this pull request on a separate tangentially related request.

The and_ & or_ traits are pretty simple -- why not create variadic versions of them which are local to the implementation of variadic tuple? Once something else requires them, they can be integrated into the type_traits.h file at that time.

andrewcorrigan commented 9 years ago

Can you please clarify what you want to do about the request you made in https://github.com/thrust/thrust/pull/488#issuecomment-42870381? Which directory structure do you want now?

If you still want the new directory structure, I am going to do a PR with just that reorganization. If not, then I am going to move the contents of detail/tuple back into detail. Either way, the impending variadic tuple PR needs to have a clean diff against master for it to be manageable.

jaredhoberock commented 9 years ago

That directory structure I suggested looks good to me. Please keep everything confined to this pull request. Thanks.

andrewcorrigan commented 9 years ago

Thanks for testing it. It should absolutely work with THRUST_DEVICE_SYSTEM_CPP. I was not able to reproduce the error you posted.

Here's what I did:

  1. I removed the line std::cout << call_functor(t) << std::endl; since call_functor is not defined.
  2. When I compiled like you did (i.e., without defining THRUST_VARIADIC_TUPLE), except with Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn), your test code compiled.
clang++ -I/Users/acorriga/code/thrust -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CPP -std=c++11 test.cpp
  1. I then added -DTHRUST_VARIADIC_TUPLE to the compilation, and that compiled as well.

It's pretty hard for me to diagnose without reproducing the error. The only thing I can say is that you are getting errors in code that would be ignored if you had defined THRUST_VARIADIC_TUPLE. What happens when you actually use variadic tuple?

andrewcorrigan commented 9 years ago

Are you sure your test code is correct? See the error:

thrust_variadic_tuple_test.cpp:43:15: note: while substituting deduced template arguments into function template 'get' [with N = 3, HT = int, TT = thrust::detail::cons<int, thrust::detail::cons<int, thrust::null_type> >]
  return test(thrust::get<Indices>(t)...);

It seems like your code is calling get<3> on a cons list (aka tuple) of length 3, which is obviously wrong.

schiller-manuel commented 9 years ago

Sorry for the false alarm, I thought that setting -std=c+11 was already sufficient and that THRUST_VARIADIC_TUPLE was not needed anymore.

When setting this define, my code compiles fine for me.

andrewcorrigan commented 9 years ago

It seems to me that if it works with variadic tuple, then it's by accident. get<3> of a tuple of size 3 should be an error.

schiller-manuel commented 9 years ago

The get<3> only occured because sizeof... reported more than 3 tuple elements (due to the remaining elements of the non-variadic tuple being filled with null_type) and compilation failed after that.

trontrytel commented 8 years ago

Sorry for asking, as I know all of this is a lot of work. But do you know when/if this variadic template will be merged into the thrust master branch?

Cazadorro commented 6 years ago

so.... I'm guessing at this point this will never actually be merged into master, How would I get access to this functionality when using master? Is there another person's include that replaces thrust::tuple that I can use instead, or do I have to switch to someone elses branch entirely?

henryiii commented 6 years ago

Thrust is currently under heavy reorganization, I believe, and so this is probably waiting until then.

This is a key feature that the Hydra library is built on, and that has it's own maintained Thrust fork (but heavily customized and renamed to avoid interfering with a system Thrust).

Cazadorro commented 6 years ago

@henryiii

but if you're interested, come to GTC 2018!

And gtc 2018 has come and gone with no mention of any of this.

I saw your link on hydra, I'm only really concerned about variadic templates and not the additional functionality, I believe that is located here. I do have CUDA 9.1 however and it looks like from the discussion you linked it uses a different version of thrust from what hosted on github currently? so maybe it is already there aswell?

henryiii commented 6 years ago

He didn't actually say Thrust would be at GTC, just their plans for CUDA...

The version built into CUDA 9.1 is mostly stripped down and split into parts (CUB, etc), and of course patched for CUDA 9. I don't think it has any new functionality.

I'm not sure how easy that is to extract from Hydra, but that is the best current version that I know of. You might be able to ask @AAAlvesJr for a recommendation for the fork he started from along with key changes necessary for CUDA 9.

brycelelbach commented 6 years ago

Hey @cazadorro - I didn't give a talk at GTC, although I was around to chat with folks about Thrust.

This PR will likely be merged for the next release of CUDA that we are working on now (not CUDA 9.2, but the one after that).

Cazadorro commented 6 years ago

@brycelelbach so in like 3 - 4 months ish?

brycelelbach commented 6 years ago

@Cazadorro I can't commit firmly to a time frame for CUDA releases. For just Thrust, I think we will have an updated version on Github and this integrated in 2 - 4 months.

andrewcorrigan commented 6 years ago

finally..

Char-Aznable commented 5 years ago

@andrewcorrigan I'm interested in this as well. I'm using cuda 10 but I can't find the implementation of variadic tuple. Did this ever make it to either CUDA release thrust or the github thrust?

andrewcorrigan commented 5 years ago

Not that I'm aware of. I've been migrating my code away from relying on Thrust for this sort of thing anyway, so I haven't really kept up with more recent Thrust versions.

Cazadorro commented 5 years ago

@andrewcorrigan How have you been able to do this? I've found thrust is convenient until it isn't (like in these scenarios). What do you do to migrate away?

brycelelbach commented 5 years ago

It's still on our TODO list - probably for this winter.


Bryce Adelstein Lelbach aka wash ISO C++ LEWGI Chair CppCon and C++Now Program Chair Thrust Maintainer, HPX Developer CUDA Convert and Reformed AVX Junkie

Ask "Dumb" Questions


From: Andrew Corrigan notifications@github.com Sent: Wednesday, December 19, 2018 10:19 PM To: thrust/thrust Cc: Bryce Lelbach; Mention Subject: Re: [thrust/thrust] variadic template implementation of thrust::tuple (#524)

Not that I'm aware of. I've been migrating my code away from relying on Thrust for this sort of thing anyway.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/thrust/thrust/issues/524#issuecomment-448845128, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAYTco9Ssul8F_yGCVI82dO3TTPUysF2ks5u6wG2gaJpZM4CL8QU.


This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

andrewcorrigan commented 5 years ago

@Cazadorro When thrust was initially developed it had to duplicate functionality from the (pre-standard) C++11 standard library and Boost to be __device__ aware. One of the big changes is that now constexpr code is implicitly __device__, so for example, thrust::tuple can be replaced by std::tuple. std::complex can replace thrust::complex. std::array can be used too; AFAIK thrust::array was never implemented anyway.

Another change is that much of Thrust has been standardized as part of the C++17 parallel algorithms library, so now there are alternatives like parallelstl that cover many of the parallel algorithms provided by Thrust.

Cazadorro commented 5 years ago

@andrewcorrigan Oh I didn't realize that! so it would be better now for thrust to just not duplicate that functionality because it all works with standard. I knew it was odd when I was able to just use std::get on thrust::tuples. Is cuda an actual backend to parrallel algorithms? So far it seems GCC is not up to date with that part of the standard and has only experimentally merged intels parallel implementation.

andrewcorrigan commented 5 years ago

There are indications [1] that there will be CUDA support in the parallel algorithms library implementation, but I don't know anything other than what I read on public mailing lists.

[1] http://lists.llvm.org/pipermail/llvm-dev/2018-October/126998.html

leofang commented 2 years ago

Hi @andrewcorrigan @brycelelbach @allisonvacanti Just checking in: Any progress for improving thrust::tuple after NVIDIA/thrust#1310? I just realized that thrust::tuple cannot contain more than 10 elements and there is no WAR.

(We can work around on our side so it's ok if this is not on the near-term roadmap, but it's still nice to have.)

#include <tuple>
#include <thrust/tuple.h>
#include <thrust/type_traits/integer_sequence.h>

template <typename F, size_t... Is>
auto gen_tuple_impl(F func, std::index_sequence<Is...> ) {
    return std::make_tuple(func(Is)...);
}

template <size_t N, typename F>
auto gen_tuple(F func) {
    return gen_tuple_impl(func, std::make_index_sequence<N>{} );
}

template <typename F, size_t... Is>
auto gen_thrust_tuple_impl(F func, thrust::index_sequence<Is...> ) {
    return thrust::make_tuple(func(Is)...);
}

template <size_t N, typename F>
auto gen_thrust_tuple(F func) {
    return gen_thrust_tuple_impl(func, thrust::make_index_sequence<N>{} );
}

int main() {

    constexpr int NDIM = 11;
    int data[NDIM];
    auto tuple_std = gen_tuple<NDIM>([&](int i){ return data[i]; });  // this works
    auto tuple_thrust = gen_thrust_tuple<NDIM>([&](int i){ return data[i]; });  // this does not

    auto tuple_std2 = std::make_tuple(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11);  // this works
    auto tuple_thrust2 = thrust::make_tuple(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11);  // this does not

    return 0;
}

cc: @kmaehashi @asi1024

andrewcorrigan commented 2 years ago

The only other info on this that I'm aware of is here: https://github.com/NVIDIA/cccl/issues/742. Of course, I don't speak for NVIDIA / Thrust.

leofang commented 2 years ago

Ah yes, thanks for pointer Andrew. I've seen it but forgot about it...