Open antequ opened 3 years ago
The CPU and GPU unit tests have very similar results, with the exception of correctness and dual_correctness: the cpu version is wrong on 0 of 10000, while the GPU version is wrong on 10000 of 10000.
Unit tests (GPU version): compiles: success correctness: wrong on 10000 of 10000, most errors range from 0.1 to 0.4. dual_correctness: same results as correctness (wrong 10000 of 10000), with similar errors multi_level:
fmmtl/unit_tests$ ./multi_level
WARNING: Expansion does not have a correct M2T!
WARNING: Expansion does not have a correct M2T!
rexact = 0.589133 -0.200385 -0.200385 -0.200385
rm2t1 = 0 0 0 0
[-0.589133 0.200385 0.200385 0.200385]
rm2t2 = 0 0 0 0
[-0.589133 0.200385 0.200385 0.200385]
rfmm = 0.589147 -0.200343 -0.200343 -0.200343
[1.44709e-05 4.28269e-05 4.28269e-05 4.28269e-05]
single_level:
fmmtl/unit_tests$ ./single_level
WARNING: Expansion does not have a correct M2T!
DIST: (0.8, 0.8, 0.8) : 1.38564
rexact = 0.589133 -0.200385 -0.200385 -0.200385
rm2t = 0 0 0 0
[-0.589133 0.200385 0.200385 0.200385]
rfmm = 0.589056 -0.201209 -0.201209 -0.201209
[-7.64378e-05 -0.00082331 -0.00082331 -0.00082331]
test_balltree: looks fine test_bbfmm:
fmmtl/unit_tests$ ./test_bbfmm
has_eval_op: 1
has_transpose: 1
has_vector_S2T_symm: 0
has_vector_S2T_asymm: 0
has_init_multipole: 1
has_init_local: 1
has_S2M: 1
has_scalar_S2M: 0
has_vector_S2M: 1
has_S2L: 0
has_scalar_S2L: 0
has_vector_S2L: 0
has_M2M: 1
has_M2L: 1
has_L2L: 1
has_M2T: 0
has_scalar_M2T: 0
has_vector_M2T: 0
has_L2T: 1
has_scalar_L2T: 0
has_vector_L2T: 1
has_dynamic_MAC: 0
FMM in 0.158716 secs
FMM in 0.0063701 secs
FMM in 0.0063367 secs
Computing direct matvec...
Direct in 0.226268 secs
Vector relative error: 5.69574e-01
Average relative error: 3.63698e+00
Maximum relative error: 4.98108e+03
test_direct: fine test_expansion: Lots of issues. Pasting them for 16LaplaceSpherical, for example:
16LaplaceSpherical:
has_eval_op: 1
has_transpose: 1
has_vector_S2T_symm: 0
has_vector_S2T_asymm: 0
has_init_multipole: 1
has_init_local: 1
has_S2M: 1
has_scalar_S2M: 1
has_vector_S2M: 0
has_S2L: 0
has_scalar_S2L: 0
has_vector_S2L: 0
has_M2M: 1
has_M2L: 1
has_L2L: 1
has_M2T: 0
has_scalar_M2T: 0
has_vector_M2T: 0
has_L2T: 1
has_scalar_L2T: 1
has_vector_L2T: 0
has_dynamic_MAC: 0
test_gpu:
fmmtl/unit_tests$ ./test_gpu
terminate called after throwing an instance of 'thrust::system::system_error'
what(): parallel_for failed: no kernel image is available for execution on the device
Aborted (core dumped)
test_kdtree: looks reasonable test_kernel: Just pasted a few:
10BiotSavart:
0 0 0
has_eval_op: 1
has_transpose: 1
has_vector_S2T_symm: 0
has_vector_S2T_asymm: 0
14RosenheadMoore:
0 0 0
has_eval_op: 1
has_transpose: 1
has_vector_S2T_symm: 0
has_vector_S2T_asymm: 0
test_ndtree: looks reasonable test_s2t:
fmmtl/unit_tests$ ./test_s2t
CPU-GPU:
Vector relative error: 1
Average relative error: 1
Maximum relative error: 1
CPU-GPU Blocked:
Vector relative error: 1
Average relative error: 1
Maximum relative error: 1
test_vec:
fmmtl/unit_tests$ ./test_vec
Is POD: 0
Is trivial: 0
Is standard layout: 1
0 0 0
0 8 20
21.5407
8.24
4.38972
3.14
1 2.1 3.14 2
version:
fmmtl/unit_tests$ ./version
Using Thrust v1.9
Hi, I think the GPU implementation has computation correctness errors.
I built everything with CUDA enabled, and the results I get from running the examples in the examples directory have high relative error, meaning the output is unusable. To compare, I also built everything for the cpu, with CUDA disabled, and their results have relative errors close to 0. Can you fix the CUDA computation errors?
Examples: Here are the CUDA-enabled GPU versions:
On the other hand, kNN appears to be correct, implying there's just some kind of scale error:
In contrast, here are the errors from the CPU versions (I used
make clean && make -j34 error_biot error_laplace error_img NO_CUDA=1
to build this, after I added-fPIC
toCXXFLAGS
. I couldn't compile error_barycentric, but I don't need it.):I'll include more about the unit tests in the next reply.
Here is some info about my OS and hardware: OS: Ubuntu 18.04 Compiled with Boost 1.65, g++-4.8, and nvcc V10.0.130. GPU: Nvidia Geforce GTX 1080, Driver 440.82, Cuda Driver 10.2 / Runtime 10.0, capability 6.1.
I also tried replacing all doubles with floats in the codebase (and changing some of the epsilon tolerances accordingly), and tested error_biot. The CPU version matched the correct results above, and the GPU version produced the same erroneous results.
If you're interested, here's the result from my deviceQuery:
Please look into this and correct the GPU implementation!
Thanks, Ante