Closed GoogleCodeExporter closed 9 years ago
And for the record, performing a transformed reduction yields correct results.
(ie transform to bool, reduce using maximum). This is my current workaround.
Original comment by ekel...@gmail.com
on 16 Jul 2010 at 12:26
Thanks for the report.
I can reproduce the bug with
$ nvcc anyofTest.cu -run
but the bug goes away with
$ nvcc anyofTest.cu -arch=sm_20 -run
I suspect this is a compiler bug. We'll investigate further.
Original comment by jaredhoberock
on 16 Jul 2010 at 5:07
[deleted comment]
Here's a simpler test case with the same pattern of true and false that
reproduces the error. Note that the test case has exactly 256 true values. If
the number of true values is changed to 255 or 257 then any_of returns the
correct result. I suspect that the true values are getting combined additively
instead of logically somewhere.
#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/logical.h>
#include <thrust/functional.h>
struct nonzeroPred : thrust::unary_function<float2, bool>{
__host__ __device__
bool operator()(float2 x) {
return ((fabs(x.x) > 0.0001f) || (fabs(x.y) > 0.0001f));
}
};
int main(void) {
thrust::host_vector<float2> input(1 << 16, make_float2(0.0f,0.0f));
for(size_t i = 0; i < 256; i++)
input[i + (1 << 15)] = make_float2(1.0f,1.0f);
thrust::device_vector<float2> inputDevice(input.begin(), input.end());
bool resultGPU = thrust::any_of(inputDevice.begin(), inputDevice.end(), nonzeroPred());
bool resultCPU = thrust::any_of(input.begin(), input.end(), nonzeroPred());
std::cout << "resultGPU: " << resultGPU << std::endl;
std::cout << "resultCPU: " << resultCPU << std::endl;
return 0;
}
Original comment by wnbell
on 16 Jul 2010 at 7:13
Even simpler now :)
I also determined that the number of CUDA blocks used in the reduction is
irrelevant. Even a single block produces the wrong answer.
#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/logical.h>
#include <thrust/functional.h>
int main(void)
{
thrust::host_vector<bool> input(1 << 16, false);
for(size_t i = 0; i < 255; i++)
input[i + (1 << 15)] = true;
thrust::device_vector<bool> inputDevice(input.begin(), input.end());
thrust::detail::equal_to_value<bool> pred(true);
bool resultGPU = thrust::any_of(inputDevice.begin(), inputDevice.end(), pred);
bool resultCPU = thrust::any_of(input.begin(), input.end(), pred);
std::cout << "resultGPU: " << resultGPU << std::endl;
std::cout << "resultCPU: " << resultCPU << std::endl;
return 0;
}
Original comment by wnbell
on 16 Jul 2010 at 7:40
Here's what I've learned so far:
1) The bug occurs even if you make the cuda::reduce_in_smem kernel completely
trivial (one thread processing the whole array in serial).
2) The problem had nothing to do with the 256-ness of the input. When
trivializing the reduction kernel as in 1) the problem was reproducible with
the sequence [false, true]. Basically, anything not beginning with true would
expose the problem. When using the normal cuda::reduce() the 256 case happened
to be one where no thread started with a true value.
3) The problem goes away if you pass device::reduce() an explicit sequence as
opposed to a zip_iterator (i.e. storing zipped sequence as tuples in memory)
4) The problem goes away if we change the body of any_of to the one-line
expression
return thrust::find_if(first, last, pred) != last;
The last point suggests that the (unnecessary) outermost transform_iterator(...
, pred) is causing a problem. This could be a miscompilation of the composite
iterator (which is nested like 4 deep) or some extremely subtle problem in
transform_iterator. Since the problem is not present when compiling with
arch=sm_20 I would suspect the former.
At this point I'm not very optimistic that we'll be able to root cause this
one, or that there is much value in doing so.
Whatever the underlying cause, we should replace the body of any_of() with
return thrust::find_if(first, last, pred) != last;
and replace the body of all_of() with
return thrust::find_if(first, last, thrust::detail::not1(pred)) != last;
since they are simpler than the current implementations.
Original comment by wnbell
on 16 Jul 2010 at 9:36
This issue was closed by revision bdb388604a.
Original comment by wnbell
on 21 Jul 2010 at 10:01
Original issue reported on code.google.com by
ekel...@gmail.com
on 16 Jul 2010 at 12:19Attachments: