allendaicool / thrust

Automatically exported from code.google.com/p/thrust
Apache License 2.0
0 stars 0 forks source link

bug in any_of on fermi hardware #179

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.  run the attached code with the attached datafile in the same directory

What is the expected output? What do you see instead?
Both the cpu and gpu results should be 1.  On fermi hardware the gpu result is 
0.

What version of the product are you using? On what operating system?
Latest mercurial version of thrust, cuda 3.1, 64 bit suse 11.1.

Please provide any additional information below.

Original issue reported on code.google.com by ekel...@gmail.com on 16 Jul 2010 at 12:19

Attachments:

GoogleCodeExporter commented 9 years ago
And for the record, performing a transformed reduction yields correct results.  
(ie transform to bool, reduce using maximum).  This is my current workaround.

Original comment by ekel...@gmail.com on 16 Jul 2010 at 12:26

GoogleCodeExporter commented 9 years ago
Thanks for the report.

I can reproduce the bug with

$ nvcc anyofTest.cu -run

but the bug goes away with

$ nvcc anyofTest.cu -arch=sm_20 -run

I suspect this is a compiler bug.  We'll investigate further.

Original comment by jaredhoberock on 16 Jul 2010 at 5:07

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Here's a simpler test case with the same pattern of true and false that 
reproduces the error.  Note that the test case has exactly 256 true values.  If 
the number of true values is changed to 255 or 257 then any_of returns the 
correct result.  I suspect that the true values are getting combined additively 
instead of logically somewhere.

#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/logical.h>
#include <thrust/functional.h>

struct nonzeroPred : thrust::unary_function<float2, bool>{
  __host__ __device__
  bool operator()(float2 x) {
    return ((fabs(x.x) > 0.0001f) || (fabs(x.y) > 0.0001f));
  }
};

int main(void) {
    thrust::host_vector<float2> input(1 << 16, make_float2(0.0f,0.0f));

    for(size_t i = 0; i < 256; i++)
        input[i + (1 << 15)] = make_float2(1.0f,1.0f);

    thrust::device_vector<float2> inputDevice(input.begin(), input.end());

    bool resultGPU = thrust::any_of(inputDevice.begin(), inputDevice.end(), nonzeroPred());
    bool resultCPU = thrust::any_of(input.begin(), input.end(), nonzeroPred());

    std::cout << "resultGPU: " << resultGPU << std::endl;
    std::cout << "resultCPU: " << resultCPU << std::endl;

    return 0;
}

Original comment by wnbell on 16 Jul 2010 at 7:13

GoogleCodeExporter commented 9 years ago
Even simpler now :)

I also determined that the number of CUDA blocks used in the reduction is 
irrelevant.  Even a single block produces the wrong answer.

#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/logical.h>
#include <thrust/functional.h>

int main(void)
{
    thrust::host_vector<bool> input(1 << 16, false);

    for(size_t i = 0; i < 255; i++)
        input[i + (1 << 15)] = true;

    thrust::device_vector<bool> inputDevice(input.begin(), input.end());

    thrust::detail::equal_to_value<bool> pred(true);

    bool resultGPU = thrust::any_of(inputDevice.begin(), inputDevice.end(), pred);
    bool resultCPU = thrust::any_of(input.begin(), input.end(), pred);

    std::cout << "resultGPU: " << resultGPU << std::endl;
    std::cout << "resultCPU: " << resultCPU << std::endl;

    return 0;
}

Original comment by wnbell on 16 Jul 2010 at 7:40

GoogleCodeExporter commented 9 years ago
Here's what I've learned so far:

1) The bug occurs even if you make the cuda::reduce_in_smem kernel completely 
trivial (one thread processing the whole array in serial).
2) The problem had nothing to do with the 256-ness of the input.  When 
trivializing the reduction kernel as in 1) the problem was reproducible with 
the sequence [false, true].  Basically, anything not beginning with true would 
expose the problem.  When using the normal cuda::reduce() the 256 case happened 
to be one where no thread started with a true value.
3) The problem goes away if you pass device::reduce() an explicit sequence as 
opposed to a zip_iterator (i.e. storing zipped sequence as tuples in memory)
4) The problem goes away if we change the body of any_of to the one-line 
expression
    return thrust::find_if(first, last, pred) != last;

The last point suggests that the (unnecessary) outermost transform_iterator(... 
, pred) is causing a problem.  This could be a miscompilation of the composite 
iterator (which is nested like 4 deep) or some extremely subtle problem in 
transform_iterator.  Since the problem is not present when compiling with 
arch=sm_20 I would suspect the former.

At this point I'm not very optimistic that we'll be able to root cause this 
one, or that there is much value in doing so.

Whatever the underlying cause, we should replace the body of any_of() with
    return thrust::find_if(first, last, pred) != last;
and replace the body of all_of() with
    return thrust::find_if(first, last, thrust::detail::not1(pred)) != last;
since they are simpler than the current implementations.

Original comment by wnbell on 16 Jul 2010 at 9:36

GoogleCodeExporter commented 9 years ago
This issue was closed by revision bdb388604a.

Original comment by wnbell on 21 Jul 2010 at 10:01