Closed andrewcorrigan closed 9 months ago
See also the discussion on the mailing list [1].
[1] https://groups.google.com/d/msg/thrust-users/FKf5n11OuAQ/1tmoQABfqNwJ
I've begun a variadic branch in my repository. There are numerous tasks that must be accomplished to have a variadic thrust::tuple
. I've already begun making progress in my first commit. However, so far that was just low-hanging fruit -- I am neither a Thrust nor C++11 expert, I am just a user in desperate need of a variadic thrust::tuple
. I've only just started learning how to implement variadic templates and functions over the past few days, so I welcome guidance or hands-on collaboration on this.
First, the rest of Thrust, needs to be made compatible with the public interface of tuple, and no longer rely on the implementation details of thrust::tuple
.
tuple
of size up to or exactly ten.thrust::null_type
. In one instance it was simply replaced with thrust::tuple<>
, but otherwise, completely eliminating thrust::null_type
completely may have to wait until a true variadic tuple is used since null_type is what makes a lot of code/meta-programming with thrust::tuple
work.tuple::head_type
and tuple::tail_type
.detail::cons
.Then, an equivalent to <tuple>
needs to be provided. I am aware currently of two options:
#include <tuple>
and using std::tuple
. This would only work for host code, but could be used for development for the time being, since nvcc
does not provide a __device__
aware std::tuple
, but perhaps a future release will provide this and it would just be a matter of waiting?std::tuple
implementation.I have a question: should tuple_cat
be allowed to make use of the internal implementation details of tuple
?
#include <tuple>
should be allowed to. So for example tuple_cat
could use internal implementation detail, but tuple_io
would not be allowed to. If ultimately std::tuple
is used, then tuple_cat
will be provided automatically, so the current implementation-detail-specific tuple_cat
would be overridden.std::tuple
implementation does not provide tuple_cat.Finally, I want to mention, that in my branch I am simply replacing the non-variadic code. I don't expect that approach to be taken with mainline Thrust. If and when this code is ready to be merged, I would only expect it to be experimental functionality guarded by __cplusplus >= 201103L
or some other Thrust-specific macro that a user must opt-in to.
I have fixed all the relevant code in the rest of Thrust to support variadic tuples.
I then tried the two approaches to replace non-variadic thrust::tuple
. Neither is currently working, although the first is further along, and that is what I have committed to the variadic Thrust branch.
#include <tuple>
and aliasing into the Thrust namespace. This can be enabled with the current repository by #define THRUST_VARIADIC_TUPLE
before #include <thrust/tuple.h>
. I am currently stuck on the error:/thrust/thrust/tuple.h:57:35: Implicit instantiation of undefined template 'std::__1::tuple_size<thrust::detail::tuple_of_iterator_references<int, thrust::device_reference<int> > >'
I tried various things to define tuple_size
for tuple_of_iterator_references
, but nothing seems to work.
__device__
decorators and have everything working with CUDA.Thrust now has variadic tuple support. The tuple implementation from agency did the trick. You need to define THRUST_VARIADIC_TUPLE to use the variadic implementation.
Example:
#include <sstream>
#include <iostream>
#define THRUST_VARIADIC_TUPLE
#include <thrust/tuple.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/zip_iterator.h>
int main(int argc, char** argv)
{
auto zip11_start =
thrust::make_zip_iterator(thrust::make_tuple(thrust::make_counting_iterator(1),
thrust::make_counting_iterator(2),
thrust::make_counting_iterator(3),
thrust::make_counting_iterator(4),
thrust::make_counting_iterator(5),
thrust::make_counting_iterator(6),
thrust::make_counting_iterator(7),
thrust::make_counting_iterator(8),
thrust::make_counting_iterator(9),
thrust::make_counting_iterator(10),
thrust::make_counting_iterator(11) ));
auto a = *zip11_start;
std::cout << "C++" << thrust::tuple_size<decltype(a)>::value << " enables tuples of length " << thrust::tuple_size<decltype(a)>::value << std::endl;
std::cout << "For example: " << a << std::endl;
auto b = thrust::tuple_cat(a,a);
std::cout << "Thankfully, we don't have to wait for C++" << thrust::tuple_size<decltype(b)>::value << " for tuples of length " << thrust::tuple_size<decltype(b)>::value << std::endl;
std::cout << "We already have them right now: " << b << std::endl;
return 0;
}
Gives output:
% clang++ -std=c++11 test.cpp -I/Users/acorriga/code/thrust
% ./a.out
C++11 enables tuples of length 11
For example: (1 2 3 4 5 6 7 8 9 10 11)
Thankfully we don't have to wait for C++22 for tuples of length 22
We already have them right now: (1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11)
Known issues:
__device__
code. class = typename std::enable_if<
(sizeof...(Types) == 1) &&
std::is_constructible<Type1,const UType1&>::value // XXX fill in the rest of these
>::type>
I have updated my variadic branch [1] using the latest non-recursive, variadic tuple code from [2].
Using the new variadic tuple, my code sped up by 50% when using the cpp backend with clang. I encourage other Thrust users to give this a try. You might be pleasantly surprised by a similar performance boost in your code if you use tuple a lot like my code does!
Known issues: The known issues mentioned previously are resolved. I compiled and ran with g++-4.8, AppleClang 6.0 on the CPU backend, and nvcc 7.0 rc on Mac, and the code works. A new known issue is that compiling on Linux, with GCC triggers an internal compiler error. I just submitted this to NVIDIA with bug-id 1608271:
thrust/thrust/detail/tuple/variadic_tuple.h(395): internal error: assertion failed at: "/dvs/p4/build/sw/rel/gpu_drv/r346/r346_00/drivers/compiler/edg/EDG_4.9/src/expr.c", line 3108
[1] https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/tree/variadic [2] https://github.com/jaredhoberock/tuple
When will this variadic template be merged into the thrust master branch?
I've been wondering the same thing for a long time. @jaredhoberock any word on a roadmap for C++11, including variadic tuple, support in Thrust?
If thrust::tuple was variadic, metaprogramming would be a lot easier. @andrewcorrigan I am currently using your variadic branch, it is really working great.
I should mention, I haven't updated the variadic branch since Feb. 5. I am pretty sure that there have been changes and bug fixes since then in my local copy of Thrust based on the latest https://github.com/jaredhoberock/tuple. I should get you the latest version. If you don't see an update from me in the next few days, please ping me.
@jaredhoberock Will you entertain a pull request for variadic tuple as well?
Variadic tuple is a huge, risky change. Is there any evidence that it is ready or that the compiler is ready for it?
Here are some questions I'd like to see answered before contemplating how to merge variadic tuple:
Quick response off the top of my head.
Variadic tuple is a huge, risky change.
What's risky about it? You already implemented variadic tuple, I already got it integrated. We can just leave it optional if you don't want it be the default behavior right away.
Is there any evidence that it is ready or that the compiler is ready for it?
Yes... See above. I demonstrated an example above. We are both aware of a compiler workaround to get this compiling with CUDA 7, and that the compiler bug is fixed in CUDA 7.5.
What bug fixes exist which have not been merged from jaredhoberock/tuple?
I already have them merged in my local copy. If you are interested in moving forward I will push the latest very soon (otherwise, I have limited time to work on stuff that you may or may not be interested in pursuing).
How does compilation performance change after enabling variadic tuple?
Compilation performance is almost exclusively a function of compiler frontend (clang is great, gcc is good, icpc and nvcc are terrible). Qualitatively: It's atrocious either way with nvcc. It's fine either way with clang or gcc on the host side.
How does performance of algorithms change?
My Thrust-based code sped up 50%. That is huge... I am guessing you want more fine-grained comparisons. As per my post here https://github.com/thrust/thrust/pull/488#issuecomment-71380706, myself and others I work with have tried to run the performance tests built-in to Thrust, but with no luck. Would you be willing to help with this? I can get you precise error messages and description of what we tried.
Do any unit tests fail?
As per my post here https://github.com/thrust/thrust/pull/488#issuecomment-71380706, I have not had success getting the performance / unit tests to actually run. Would you be willing to help with this? I can get you precise error messages and description of what we tried.
What is the delta in total lines of code? If the old version of tuple was replaced with the new, would there be a net decrease in code size?
Net decrease.
There is a lot of repetitive code in master due to the lack of variadic templates. See for example this commit: [https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/commit/d000fcf7f1302cdaa7b615dc3c903fadb0cbc18f. Variadic tuple [https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/blob/variadic/thrust/detail/tuple/variadic_tuple.h] itself is ~700 lines. This replaces, for example [https://github.com/thrust/thrust/blob/master/thrust/detail/tuple.inl] which is larger.
If you would care to file a new issue for getting scons to work on Mac, we can discuss the problems you encountered with unit tests there.
Besides the functional correctness of the variadic tuple implementation, I am concerned about the compile time and runtime performance of algorithms which use tuples internally in their implementation. Tuple is ubiquitous within Thrust's implementation. That means that casual users of Thrust who are uninterested in using variadic tuple will be indirectly affected by them. For example, if introducing variadic tuple were to increase the compile time of, say, reduce_by_key
for no additional runtime performance benefit, then it is premature to do it.
In order to be convinced that it is safe to do this merge, we need to convince ourselves that this change does no harm.
I want to say that it's wonderful to finally get some form of feedback on all this. I was confused when I went through the trouble of getting this new significant functionality working, observed a significant performance boost on top of that, and yet there was complete silence in response.
std
, even when __TUPLE_NAMESPACE
is thrust
. When I integrated your variadic tuple with Thrust, I changed this so everything goes in thrust
, since the rest of Thrust assumes this. What approach do you want to take, put everything in Thrust? Or do you want to clean up the rest of Thrust, so that it uses, for example, std::get
instead of thrust::get
?_cpp03.h
and suffix the new stuff with _cpp11.h
.tuple
needs to go in namespace thrust
. We should just keep it simple for now and simply generalize Thrust's current implementation to use variadics without introducing any new functionality.Thanks. I will start using file suffixes to handle the differences. If there is anything else you already don't like that we haven't already discussed, or anything you discover if/when you look at the code, can you please let me know?
Aside from any possible performance issues putting things on hold, I want this code to be as ready-to-merge as possible.
I don't think we've discussed unit tests -- there will need to be new tests which test the functionality of tuples larger than 10 elements. These should go in the tuple-specific unit tests and also the tests for zip_iterator
.
What about tuple_io
and tuple_cat
? We either need to get the open pull request merged (less work on my end), or that also needs to be folded into one massive PR.
Let's keep the scope of this effort focused on variadic tuple for now in order to maximize the chance of its success.
So then are you going to merge the older PR first? If you don't merge that first, I have to redo a lot of variadic tuple work since variadic tuple was done according to the new directory structure you asked me to follow in the tuple_io PR.
In terms of duplicating things on a file-by-file basis, what about thrust/detail/type_traits.h
for example. That is a large file, and only the implementation of and_
and or_
needs to be done using variadic templates. Can that be done with an #ifdef
like was done to implement result_of
?
@dachziegel I just pushed to https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/tree/variadic, which includes the latest from https://github.com/jaredhoberock/tuple, as well as a workaround for a bug in CUDA 7.0. See https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/commits/variadic for details. Please let me know if it works for you or not.
Thanks Andrew, I will try it.
Should this also work with THRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CPP?
I tried this code: thrust_variadic_tuple_test.cpp
It fails to compile for me using
clang++-3.7 -I/tmp/thrust-variadic -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CPP -std=c++11 thrust_variadic_tuple_test.cpp
However, using std::tuple
(hacky #define thrust std
), it compiles.
Yes, I agree that integrating variadic tuple will require quite a bit of work. Nonetheless, I do not wish to condition this pull request on a separate tangentially related request.
The and_
& or_
traits are pretty simple -- why not create variadic versions of them which are local to the implementation of variadic tuple? Once something else requires them, they can be integrated into the type_traits.h
file at that time.
Can you please clarify what you want to do about the request you made in https://github.com/thrust/thrust/pull/488#issuecomment-42870381? Which directory structure do you want now?
If you still want the new directory structure, I am going to do a PR with just that reorganization. If not, then I am going to move the contents of detail/tuple
back into detail
. Either way, the impending variadic tuple PR needs to have a clean diff against master for it to be manageable.
That directory structure I suggested looks good to me. Please keep everything confined to this pull request. Thanks.
Thanks for testing it. It should absolutely work with THRUST_DEVICE_SYSTEM_CPP
. I was not able to reproduce the error you posted.
Here's what I did:
std::cout << call_functor(t) << std::endl;
since call_functor
is not defined.THRUST_VARIADIC_TUPLE
), except with Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn)
, your test code compiled.clang++ -I/Users/acorriga/code/thrust -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CPP -std=c++11 test.cpp
-DTHRUST_VARIADIC_TUPLE
to the compilation, and that compiled as well.It's pretty hard for me to diagnose without reproducing the error. The only thing I can say is that you are getting errors in code that would be ignored if you had defined THRUST_VARIADIC_TUPLE
. What happens when you actually use variadic tuple?
Are you sure your test code is correct? See the error:
thrust_variadic_tuple_test.cpp:43:15: note: while substituting deduced template arguments into function template 'get' [with N = 3, HT = int, TT = thrust::detail::cons<int, thrust::detail::cons<int, thrust::null_type> >]
return test(thrust::get<Indices>(t)...);
It seems like your code is calling get<3>
on a cons list (aka tuple) of length 3, which is obviously wrong.
Sorry for the false alarm, I thought that setting -std=c+11
was already sufficient and that THRUST_VARIADIC_TUPLE
was not needed anymore.
When setting this define, my code compiles fine for me.
It seems to me that if it works with variadic tuple, then it's by accident. get<3> of a tuple of size 3 should be an error.
The get<3>
only occured because sizeof...
reported more than 3 tuple elements (due to the remaining elements of the non-variadic tuple being filled with null_type
) and compilation failed after that.
Sorry for asking, as I know all of this is a lot of work. But do you know when/if this variadic template will be merged into the thrust master branch?
so.... I'm guessing at this point this will never actually be merged into master, How would I get access to this functionality when using master? Is there another person's include that replaces thrust::tuple that I can use instead, or do I have to switch to someone elses branch entirely?
Thrust is currently under heavy reorganization, I believe, and so this is probably waiting until then.
This is a key feature that the Hydra library is built on, and that has it's own maintained Thrust fork (but heavily customized and renamed to avoid interfering with a system Thrust).
@henryiii
but if you're interested, come to GTC 2018!
And gtc 2018 has come and gone with no mention of any of this.
I saw your link on hydra, I'm only really concerned about variadic templates and not the additional functionality, I believe that is located here. I do have CUDA 9.1 however and it looks like from the discussion you linked it uses a different version of thrust from what hosted on github currently? so maybe it is already there aswell?
He didn't actually say Thrust would be at GTC, just their plans for CUDA...
The version built into CUDA 9.1 is mostly stripped down and split into parts (CUB, etc), and of course patched for CUDA 9. I don't think it has any new functionality.
I'm not sure how easy that is to extract from Hydra, but that is the best current version that I know of. You might be able to ask @AAAlvesJr for a recommendation for the fork he started from along with key changes necessary for CUDA 9.
Hey @cazadorro - I didn't give a talk at GTC, although I was around to chat with folks about Thrust.
This PR will likely be merged for the next release of CUDA that we are working on now (not CUDA 9.2, but the one after that).
@brycelelbach so in like 3 - 4 months ish?
@Cazadorro I can't commit firmly to a time frame for CUDA releases. For just Thrust, I think we will have an updated version on Github and this integrated in 2 - 4 months.
finally..
@andrewcorrigan I'm interested in this as well. I'm using cuda 10 but I can't find the implementation of variadic tuple. Did this ever make it to either CUDA release thrust or the github thrust?
Not that I'm aware of. I've been migrating my code away from relying on Thrust for this sort of thing anyway, so I haven't really kept up with more recent Thrust versions.
@andrewcorrigan How have you been able to do this? I've found thrust is convenient until it isn't (like in these scenarios). What do you do to migrate away?
It's still on our TODO list - probably for this winter.
Bryce Adelstein Lelbach aka wash ISO C++ LEWGI Chair CppCon and C++Now Program Chair Thrust Maintainer, HPX Developer CUDA Convert and Reformed AVX Junkie
From: Andrew Corrigan notifications@github.com Sent: Wednesday, December 19, 2018 10:19 PM To: thrust/thrust Cc: Bryce Lelbach; Mention Subject: Re: [thrust/thrust] variadic template implementation of thrust::tuple (#524)
Not that I'm aware of. I've been migrating my code away from relying on Thrust for this sort of thing anyway.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/thrust/thrust/issues/524#issuecomment-448845128, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAYTco9Ssul8F_yGCVI82dO3TTPUysF2ks5u6wG2gaJpZM4CL8QU.
@Cazadorro When thrust was initially developed it had to duplicate functionality from the (pre-standard) C++11 standard library and Boost to be __device__
aware. One of the big changes is that now constexpr
code is implicitly __device__
, so for example, thrust::tuple
can be replaced by std::tuple
. std::complex
can replace thrust::complex
. std::array
can be used too; AFAIK thrust::array
was never implemented anyway.
Another change is that much of Thrust has been standardized as part of the C++17 parallel algorithms library, so now there are alternatives like parallelstl that cover many of the parallel algorithms provided by Thrust.
@andrewcorrigan Oh I didn't realize that! so it would be better now for thrust to just not duplicate that functionality because it all works with standard. I knew it was odd when I was able to just use std::get on thrust::tuples. Is cuda an actual backend to parrallel algorithms? So far it seems GCC is not up to date with that part of the standard and has only experimentally merged intels parallel implementation.
There are indications [1] that there will be CUDA support in the parallel algorithms library implementation, but I don't know anything other than what I read on public mailing lists.
[1] http://lists.llvm.org/pipermail/llvm-dev/2018-October/126998.html
Hi @andrewcorrigan @brycelelbach @allisonvacanti Just checking in: Any progress for improving thrust::tuple
after NVIDIA/thrust#1310? I just realized that thrust::tuple
cannot contain more than 10 elements and there is no WAR.
(We can work around on our side so it's ok if this is not on the near-term roadmap, but it's still nice to have.)
#include <tuple>
#include <thrust/tuple.h>
#include <thrust/type_traits/integer_sequence.h>
template <typename F, size_t... Is>
auto gen_tuple_impl(F func, std::index_sequence<Is...> ) {
return std::make_tuple(func(Is)...);
}
template <size_t N, typename F>
auto gen_tuple(F func) {
return gen_tuple_impl(func, std::make_index_sequence<N>{} );
}
template <typename F, size_t... Is>
auto gen_thrust_tuple_impl(F func, thrust::index_sequence<Is...> ) {
return thrust::make_tuple(func(Is)...);
}
template <size_t N, typename F>
auto gen_thrust_tuple(F func) {
return gen_thrust_tuple_impl(func, thrust::make_index_sequence<N>{} );
}
int main() {
constexpr int NDIM = 11;
int data[NDIM];
auto tuple_std = gen_tuple<NDIM>([&](int i){ return data[i]; }); // this works
auto tuple_thrust = gen_thrust_tuple<NDIM>([&](int i){ return data[i]; }); // this does not
auto tuple_std2 = std::make_tuple(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11); // this works
auto tuple_thrust2 = thrust::make_tuple(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11); // this does not
return 0;
}
cc: @kmaehashi @asi1024
The only other info on this that I'm aware of is here: https://github.com/NVIDIA/cccl/issues/742. Of course, I don't speak for NVIDIA / Thrust.
Ah yes, thanks for pointer Andrew. I've seen it but forgot about it...
thrust::tuple does not use variadic templates. A variadic template implementation of thrust::tuple would not just make the internal implementation more elegant, there are serious practical issues with the current thrust::tuple implementation:
In consideration of the fact that C++11 support in nvcc is still in the release candidate, undocumented, and requires host compiler support, I don't expect C++03 support to be dropped anytime soon. Instead, can a user-supplied compile-time macro (e.g. THRUST_USE_CXX11) or auto-detection of __cplusplus >= 201103L please be considered to switch to an alternate implementation?
[1] http://www.jot.fm/issues/issue_2008_02/article2.pdf