Open blackwer opened 2 months ago
If you read https://finufft.readthedocs.io/en/latest/c_gpu.html you'll see meth>2 is unsupported.
I just verified meth=1,2 run fine for this test:
(base) ahb@workergpu094 /mnt/home/ahb/numerics/finufft/build> perftest/cuda/cuperftest --prec f --n_runs 5 --method 1 --sort 1 --N1 256 --N2 256 --N3 256 --kerevalmethod 1 --M 1E8 --tol 1E-1
# prec = f
# type = 1
# n_runs = 5
# N1 = 256
# N2 = 256
# N3 = 256
# M = 100000000
# ntransf = 1
# method = 1
# kerevalmethod = 1
# sort = 1
# tol = 0.1
event,count,tot(ms),mean(ms),std(ms),nupts/s,ns/nupt
host_to_device,1,433.910034,433.910034,0.000000,0.0,0.0
makeplan,1,17.662464,17.662464,0.000000,0.0,0.0
setpts,5,278.904816,55.780964,1.838613,1.79273e+09,0.557810
execute,5,634.794006,126.958801,0.327698,7.87657e+08,1.269588
device_to_host,1,28.293217,28.293217,0.000000,0.0,0.0
amortized,1,1393.605591,1393.605591,0.000000,3.58782e+08,2.787211
(base) ahb@workergpu094 /mnt/home/ahb/numerics/finufft/build> perftest/cuda/cuperftest --prec f --n_runs 5 --method 2 --sort 1 --N1 256 --N2 256 --N3 256 --kerevalmethod 1 --M 1E8 --tol 1E-1
# prec = f
# type = 1
# n_runs = 5
# N1 = 256
# N2 = 256
# N3 = 256
# M = 100000000
# ntransf = 1
# method = 2
# kerevalmethod = 1
# sort = 1
# tol = 0.1
event,count,tot(ms),mean(ms),std(ms),nupts/s,ns/nupt
host_to_device,1,436.325928,436.325928,0.000000,0.0,0.0
makeplan,1,18.078913,18.078913,0.000000,0.0,0.0
setpts,5,277.110779,55.422157,0.739676,1.80433e+09,0.554222
execute,5,585.890808,117.178162,0.604296,8.53401e+08,1.171782
device_to_host,1,29.232191,29.232191,0.000000,0.0,0.0
amortized,1,1346.681641,1346.681641,0.000000,3.71283e+08,2.693363
So I propose we ignore this. Unless someone wants to invest time benchmarking and debugging meth=4 (is it much faster?).
@DiamonDinoia Has method 4 shown to be faster in any of the tests? Wondering if we should commit this method to the graveyard or if we think it's salvageable and potentially worthwhile.
BlockGather can be faster than SM (optimized):
tolerance | BlockGather | SM |
---|---|---|
1E-1 | 2.949568e+08 | 2.973108e+09 |
1E-2 | 2.503649e+08 | 9.414288e+08 |
1E-3 | 2.113974e+08 | 4.285620e+08 |
1E-4 | 1.643028e+08 | 1.251205e+08 |
1E-5 | 1.260375e+08 | 7.427078e+07 |
1E-6 | 8.393369e+07 | 3.152199e+07 |
@DiamonDinoia pointed this out to me.
will exit out during the
spread_3d_block_gather
routine due to a memory access violation. This failure seems to be due to numerous integer overflow problems resulting in a wrongidxnupts
.https://github.com/flatironinstitute/finufft/blob/22c4f9adf57d4eb4c72afbe58253f841f09314be/src/cuda/3d/spreadinterp3d.cuh#L330-L347
The code consistently crashes at
cnow = c[ii];
above, becauseii
can become negative.Given the pervasive use of
int
in this code for various indexes and intermediate calculations, it's hard to nail down a specific line of code as the problem. The fix, usingint64
everywhere, may have performance problems. In the short, it might be useful to pin down exactly which routine is causing the overflow and check for it, returning if there's an issue with a more useful error message.