Gpu implementation has high relative error for provided examples

Hi, I think the GPU implementation has computation correctness errors.

I built everything with CUDA enabled, and the results I get from running the examples in the examples directory have high relative error, meaning the output is unusable. To compare, I also built everything for the cpu, with CUDA disabled, and their results have relative errors close to 0. Can you fix the CUDA computation errors?

Examples: Here are the CUDA-enabled GPU versions:

fmmtl/examples$ ./error_laplace
FMM in 0.216471 secs
FMM in 0.0459567 secs
FMM in 0.0447383 secs
Computing direct matvec...
Direct in 0.922225 secs
Vector  relative error: 0.623028
Average relative error: 0.444972
Maximum relative error: 1.20022

fmmtl/examples$ ./error_biot
FMM in 0.252116 secs
FMM in 0.0779735 secs
FMM in 0.0770103 secs
Computing direct matvec...
Direct in 1.33767 secs
Vector  relative error: 0.684924
Average relative error: 0.489844
Maximum relative error: 7.83254

fmmtl/examples$ ./error_barycentric
FMM in 0.152405 secs
FMM in 0.00364061 secs
FMM in 0.00367243 secs
Computing direct matvec...
Direct in 0.163656 secs
Vector  relative error: 0.999659
Average relative error: 3.72861
Maximum relative error: 13593

fmmtl/examples$ ./error_img
Initializing source and N = 1048576 targets...
Building the kernel matrix...
Performing the kernel matrix-vector mult...
Computing direct kernel matrix-vector mult...
Computing the errors...
Min log error: -16
Max log error: 0

On the other hand, kNN appears to be correct, implying there's just some kind of scale error:

fmmtl/examples$ ./kNN
Construct: 0.000110769
Traverse: 0.00758732
Computing direct...
Direct: 0.0059448
Wrong counts: 0 of 1000
((0.00202416, 907), (0.00437583, 298), (0.00447187, 507), (0.00450509, 178), (0.00470155, 833))

In contrast, here are the errors from the CPU versions (I used make clean && make -j34 error_biot error_laplace error_img NO_CUDA=1 to build this, after I added -fPIC to CXXFLAGS. I couldn't compile error_barycentric, but I don't need it.):

fmmtl/examples$ ./error_laplace
FMM in 0.0921008 secs
FMM in 0.0676527 secs
FMM in 0.067525 secs
Computing direct matvec...
Direct in 0.895511 secs
Vector  relative error: 3.36163e-05
Average relative error: 2.86249e-05
Maximum relative error: 0.000357396

fmmtl/examples$ ./error_biot
FMM in 0.135726 secs
FMM in 0.0971907 secs
FMM in 0.104078 secs
Computing direct matvec...
Direct in 1.3523 secs
Vector  relative error: 4.42274e-05
Average relative error: 5.30808e-05
Maximum relative error: 0.00247941

fmmtl/examples$ ./error_img
Initializing source and N = 1048576 targets...
Building the kernel matrix...
Performing the kernel matrix-vector mult...
Computing direct kernel matrix-vector mult...
Computing the errors...
Min log error: -16
Max log error: -1.90081

I'll include more about the unit tests in the next reply.

Here is some info about my OS and hardware: OS: Ubuntu 18.04 Compiled with Boost 1.65, g++-4.8, and nvcc V10.0.130. GPU: Nvidia Geforce GTX 1080, Driver 440.82, Cuda Driver 10.2 / Runtime 10.0, capability 6.1.

I also tried replacing all doubles with floats in the codebase (and changing some of the epsilon tolerances accordingly), and tested error_biot. The CPU version matched the correct results above, and the GPU version produced the same erroneous results.

If you're interested, here's the result from my deviceQuery:

Device 0: "GeForce GTX 1080"
  CUDA Driver Version / Runtime Version          10.2 / 10.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 8118 MBytes (8512602112 bytes)
  (20) Multiprocessors, (128) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1734 MHz (1.73 GHz)
  Memory Clock rate:                             5005 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 79 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Please look into this and correct the GPU implementation!

Thanks, Ante

The CPU and GPU unit tests have very similar results, with the exception of correctness and dual_correctness: the cpu version is wrong on 0 of 10000, while the GPU version is wrong on 10000 of 10000.

Unit tests (GPU version): compiles: success correctness: wrong on 10000 of 10000, most errors range from 0.1 to 0.4. dual_correctness: same results as correctness (wrong 10000 of 10000), with similar errors multi_level:

fmmtl/unit_tests$ ./multi_level
WARNING: Expansion does not have a correct M2T!
WARNING: Expansion does not have a correct M2T!
rexact = 0.589133 -0.200385 -0.200385 -0.200385
rm2t1 = 0 0 0 0
    [-0.589133 0.200385 0.200385 0.200385]
rm2t2 = 0 0 0 0
    [-0.589133 0.200385 0.200385 0.200385]
rfmm = 0.589147 -0.200343 -0.200343 -0.200343
    [1.44709e-05 4.28269e-05 4.28269e-05 4.28269e-05]

single_level:

fmmtl/unit_tests$ ./single_level
WARNING: Expansion does not have a correct M2T!
DIST: (0.8, 0.8, 0.8) : 1.38564
rexact = 0.589133 -0.200385 -0.200385 -0.200385
rm2t = 0 0 0 0
    [-0.589133 0.200385 0.200385 0.200385]
rfmm = 0.589056 -0.201209 -0.201209 -0.201209
    [-7.64378e-05 -0.00082331 -0.00082331 -0.00082331]

test_balltree: looks fine test_bbfmm:

fmmtl/unit_tests$ ./test_bbfmm
has_eval_op: 1
has_transpose: 1
has_vector_S2T_symm: 0
has_vector_S2T_asymm: 0
has_init_multipole: 1
has_init_local: 1
has_S2M: 1
  has_scalar_S2M: 0
  has_vector_S2M: 1
has_S2L: 0
  has_scalar_S2L: 0
  has_vector_S2L: 0
has_M2M: 1
has_M2L: 1
has_L2L: 1
has_M2T: 0
  has_scalar_M2T: 0
  has_vector_M2T: 0
has_L2T: 1
  has_scalar_L2T: 0
  has_vector_L2T: 1
has_dynamic_MAC: 0
FMM in 0.158716 secs
FMM in 0.0063701 secs
FMM in 0.0063367 secs
Computing direct matvec...
Direct in 0.226268 secs
Vector  relative error: 5.69574e-01
Average relative error: 3.63698e+00
Maximum relative error: 4.98108e+03

test_direct: fine test_expansion: Lots of issues. Pasting them for 16LaplaceSpherical, for example:

16LaplaceSpherical:
has_eval_op: 1
has_transpose: 1
has_vector_S2T_symm: 0
has_vector_S2T_asymm: 0
has_init_multipole: 1
has_init_local: 1
has_S2M: 1
  has_scalar_S2M: 1
  has_vector_S2M: 0
has_S2L: 0
  has_scalar_S2L: 0
  has_vector_S2L: 0
has_M2M: 1
has_M2L: 1
has_L2L: 1
has_M2T: 0
  has_scalar_M2T: 0
  has_vector_M2T: 0
has_L2T: 1
  has_scalar_L2T: 1
  has_vector_L2T: 0
has_dynamic_MAC: 0

test_gpu:

fmmtl/unit_tests$ ./test_gpu
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: no kernel image is available for execution on the device
Aborted (core dumped)

test_kdtree: looks reasonable test_kernel: Just pasted a few:

10BiotSavart:
0 0 0
has_eval_op: 1
has_transpose: 1
has_vector_S2T_symm: 0
has_vector_S2T_asymm: 0

14RosenheadMoore:
0 0 0
has_eval_op: 1
has_transpose: 1
has_vector_S2T_symm: 0
has_vector_S2T_asymm: 0

test_ndtree: looks reasonable test_s2t:

fmmtl/unit_tests$ ./test_s2t
CPU-GPU:
  Vector  relative error: 1
  Average relative error: 1
  Maximum relative error: 1

CPU-GPU Blocked:
  Vector  relative error: 1
  Average relative error: 1
  Maximum relative error: 1

test_vec:

fmmtl/unit_tests$ ./test_vec
Is POD: 0
Is trivial: 0
Is standard layout: 1
0 0 0
0 8 20
21.5407
8.24
4.38972
3.14
1 2.1 3.14 2

version:

fmmtl/unit_tests$ ./version
Using Thrust v1.9

ccecka / fmmtl

Gpu implementation has high relative error for provided examples #4