flatironinstitute / finufft

Non-uniform fast Fourier transform library of types 1,2,3 in dimensions 1,2,3
Other
281 stars 72 forks source link

Benchmark blockgather (method 4) #309

Open janden opened 1 year ago

janden commented 1 year ago

Currently, we have two methods, GM (non-uniform points-driven) and SM (subproblems) for dimensions one and two, while dimension three has a third method, blockgather. We should test this against subproblems to see where it is useful.

ImanHosseini commented 1 year ago

For M = 2000000, (on RTX 3060 Ti):

# for different n1/n2/n3 and m \in {1,2,4}
./cuperftest --N1=n1 --N2=n2 --N3=n3 --method=m

plot_2M m=4 is much slower. 'spread_3d_block_gather' achieves very low (< 3%) memory throughput.

janden commented 1 year ago

@ImanHosseini Thanks for looking at this. Interesting that even at small sizes it performs much worse.

@MelodyShih Do you recall what the purpose of the block gather method was? Is there some other benefit to using it?

ahbarnett commented 1 year ago

Hi Iman,

Thanks for testing this - do you have an application in mind? I think method=4 is experimental (not discussed in our paper) and not recommended - can others comment?

I'm looping in @blackwer who has recently improved cufinufft, and wrote cuperftest. However I'm alarmed that meth=2 is 5x slower than meth=1, when it supposed to be faster.

If you study them, there are two effects you are seeing in your graphs:

i) M-dependent spreading cost (the intercept of the graphs). This appears to be 0.03s for meth=1,2, consistent with about 7e7 NU pts/s (if you are timing the average over 10 runs - are you?). Compare our results in the paper (Fig. 4): https://arxiv.org/abs/2102.08463 The default opts for the cuperftest are single-prec (FP32), for which V100 (our test) and your Ti 3060 appear to be similar on paper (16 Tflops). Our Fig. 4 (top-right panel) then gives around 8ns (for SM, ie meth=2), or 14 ns (for GM-sort, ie meth=1), per NU pt, at tol=1e-5. Either case it's around 1e8 NU pt/sec, at N=100^3. So, you are 70% of this speed. Maybe you want to verify timings vs our paper?

But 0.4s intercept for meth=4 is clearly unusably slow per NU pt.

ii) linear-in-N growth, presumably due to cuFFT calls of grid-size upsampfac^3N = 8N, where N:=N1N2*N3 (# uniform pts). This should be independent of method, but you can see it is not. This is a real mystery to me - can others comment?

Taking meth=1, slope is about 0.07s for 8N=1.3e8, ie 2e9 pts/sec for the FFT. This is for 3D complex fp32 of size 2^27. That seems about right. But I find it very hard to find wall-clock timings (not TFlops!) for 3D cufft online for, say, V100 - Melody?

meth=2 is 5x slower, but I don't get why, since meth doesn't affect the cufft call, right?

I encourage you also to run, eg cufinufft3d_test 1 1 256 256 256 2000000 1e-5 2e-4 f

Robert, maybe you can comment on use of cuperftest? (does it report an average time or total time for the default n_runs = 10 ?) I see kerevalmeth=0 is set as default here https://github.com/flatironinstitute/finufft/blob/0e2371e7ad692eaf6c40fe31993b8dc3cdbdf54d/perftest/cuda/cuperftest.cu#L91 Should that be 1 (Horner)? (Although that shouldn't affect the N-growth of ii) above).

Best, Alex

lu1and10 commented 1 year ago

Hi Iman, Thanks for testing this - do you have an application in mind? I think method=4 is experimental (not discussed in our paper) and not recommended - can others comment? I'm looping in @blackwer who has recently improved cufinufft, and wrote cuperftest. However I'm alarmed that meth=2 is 5x slower than meth=1, when it supposed to be faster. If you study them, there are two effects you are seeing in your graphs: i) M-dependent spreading cost (the intercept of the graphs). This appears to be 0.03s for meth=1,2, consistent with about 7e7 NU pts/s (if you are timing the average over 10 runs - are you?). Compare our results in the paper (Fig. 4): https://arxiv.org/abs/2102.08463 The default opts for the cuperftest are single-prec (FP32), for which V100 (our test) and your Ti 3060 appear to be similar on paper (16 Tflops). Our Fig. 4 (top-right panel) then gives around 8ns (for SM, ie meth=2), or 14 ns (for GM-sort, ie meth=1), per NU pt, at tol=1e-5. Either case it's around 1e8 NU pt/sec, at N=100^3. So, you are 70% of this speed. Maybe you want to verify timings vs our paper? But 0.4s intercept for meth=4 is clearly unusably slow per NU pt. ii) linear-in-N growth, presumably due to cuFFT calls of grid-size upsampfac^3N = 8N, where N:=N1N2*N3 (# uniform pts). This should be independent of method, but you can see it is not. This is a real mystery to me - can others comment? Taking meth=1, slope is about 0.07s for 8N=1.3e8, ie 2e9 pts/sec for the FFT. This is for 3D complex fp32 of size 2^27. That seems about right. But I find it very hard to find wall-clock timings (not TFlops!) for 3D cufft online for, say, V100 - Melody? meth=2 is 5x slower, but I don't get why, since meth doesn't affect the cufft call, right? I encourage you also to run, eg cufinufft3d_test 1 1 256 256 256 2000000 1e-5 2e-4 f Robert, maybe you can comment on use of cuperftest? (does it report an average time or total time for the default n_runs = 10 ?) I see kerevalmeth=0 is set as default here https://github.com/flatironinstitute/finufft/blob/0e2371e7ad692eaf6c40fe31993b8dc3cdbdf54d/perftest/cuda/cuperftest.cu#L91 Should that be 1 (Horner)? (Although that shouldn't affect the N-growth of ii) above). Best, Alex

I might have some idea. cufft 3d single precision complex to complex on h100 for 256^3 problem size may take about 4ms for exec only. For type1, both method 1 and method 2 should spend same amount of time on FFT, deconvolve and shuffle.

The time difference between method 1 and method 2 should come from spread kernel. Here the number of nu pts per cell(M/(upsampfac^3*N)) makes the difference for method 1 and method 2.

Loosely speaking, for method 1 we can think each cuda thread represents an nu point. If the number of nu pts per cell is large, there will be more global memory atomic updates for race condition. As N grows, number of nu pts per cell decreases, there will be less atomic race condition.

On the other hand, for method 2, we can approximately think each cuda thread represents a cell. If N grows, the number of nu pts per cell decreases, when the number of nu pts per cell decreases and reaches less than 1, some thread(cell) will have no point, here comes the thread divergence, there will be both intra warp thread divergence and inter warps thread divergence(since we use shared memory and synchronize cuda block).

I would guess in the uniform random points case, if the number of nu pts per cell > 2, method 2 wins. if the number of nu pts per cell < 1, method 1 wins. In the plot, with M=2e6, N1=N2=N3=256, the number of nu pts per cell is much less than 1, there will be much more thread divergence for method 2.

I guess if N1=N2=N3=10, there will be a lot global memory atomic update races for method1, each thread(nu point) try to atomic update cells. While method 2 has much less atomic race since each thread represents a cell and also shared memory atomic update should be faster than global memory atomic update.

ImanHosseini commented 1 year ago

I do not have an application in mind, just wanted to help with the issue. The numbers are total numbers for 10 runs (which is what the test prints out as "total"). The difference in method 1, 2 comes down to the 'spread_3d_subprob' kernel as @lu1and10 suggested and not the cufft calls. I have checked that indeed for high M, method 2 ('spread_3d_subprob' kernel) fares much better with 8x8x8 than 256x256x256 directly due to less time spent waiting on barrier. In fact, looking at Warp States, at N=8^3, 1.2% of time warps were stalled due to a barrier, this number jumps to 76% for N=256^3 tanking performance. I ran new tests at small N regime (n = 4^3->20^3) for M=2K, 20K, 200K, 2M: image (1) image (2) At high N, the time is dominated by 'spread_3d_subprob' and thus the worse slope compared to method 1. Method1 is dominated by kernel 'spread_3d_nupts_driven' which has much better memory throughput (27% at N=256^3) - no syncs -.

blackwer commented 1 year ago

m=4 is much slower. 'spread_3d_block_gather' achieves very low (< 3%) memory throughput.

Method 4 should be viewed as experimental, but I can provide some insight here. This only happens when using naive evaluation via kerevalmethod=0, rather than horners rule kerevalmethod=1. Looking at the code, this is because, as written, the kernel evaluations are not being memoized in the 0 case. I.e. in the cube around the point you are spreading of width n, n^3 kernel evaluations are being made, but the horner method is rightfully only doing n evaluations. I've updated the default to be 1 as it should be. There is little reason other than testing to use 0.

I'll investigate the rest as the day goes on.

ImanHosseini commented 1 year ago

Using horner, now it's better then method2: plot_h And concentrating more on low N regime for fixed M: plot_hLR And for scaled M (M = 2(N1xN2xN3)): plot_hLRsM An interesting view is this 2D-map [of 40x40 data points] showing the best method for each <N, M>: cmap_40

ahbarnett commented 1 year ago

Some responses:

Robert: thanks for setting default to horner (1). The ns^3 ker evals was an old bug of Melody's that should be totally removed - I thought there was no way to trigger that any more. (ie even kereval=0 should do 3*ns evals in 3D...)

Libin: Thanks for the timing. So a 512^3 (fine grid FFT) should be 32ms? Seems about right. Also for the detailed explanation - I agree, except for the fact that "shared memory" (on each core, used in meth 2, "SM" in paper) is supposed to be much faster than global memory (meth 1, "GM-sort" in paper). In our paper this is why SM was faster even though the density (another word for your NU pts per grid pt = M/8N) was around 1, not that high.

Iman: Thanks for the new timings - looks like meth=2 only helps at N << M ? That was not our finding on the V100, where in the paper meth=2 was much faster. You will want to make sure kereval=1 is used always, for fair comparison - it should now be the default as of https://github.com/flatironinstitute/finufft/commit/cf516fb33ec4df35db43dac7f239156f9fa99bfe

It seems like if you're reporting the total not mean time from cuperftest, that is a huge 10x speedup relative to what I was saying in my last post to this Issue. Ie you are getting 6e8 NU pts/sec ?

Do you agree with Libin's analysis?

However, as you can see from this confusion re total vs mean, it would be more useful for us (& users) to phrase your timings in terms of throughput (NU pts/sec), as in my last post, for various densities (M/(2^d N)). For many applications the density is around 0.1-10. But just giving the timing in secs for a problem is less useful.

I think in order to resolve the above we need to split out the "execute" timings for spread/interp, vs fft, vs deconvolve, vs H2D and D2H transfer. Over in the legacy repo there are spread{1,2,3}d testers which I found useful. Maybe we could bring them over and tweak them for the new interface?

Thanks for your ongoing work on this library, Best, Alex

On Fri, Sep 8, 2023 at 4:52 PM Iman Hosseini @.***> wrote:

Using horner fixed the perf issue, now it's better then method2: [image: plot_h] https://user-images.githubusercontent.com/12172889/266699346-d8ac2a42-9428-468e-b05e-29d4a49e2520.png And concentrating more on low N regime for fixed M: [image: plot_hLR] https://user-images.githubusercontent.com/12172889/266700964-b212da76-bb2c-464b-a284-857e86ccd23a.png And for scaled M (M = 2N1N2*N3): [image: plot_hLRsM] https://user-images.githubusercontent.com/12172889/266700985-7aca419e-a361-4a18-83dd-7ec2f4a5140d.png An interesting view is this 2D-map showing the best method for each <N, M>: [image: cmap_40] https://user-images.githubusercontent.com/12172889/266728264-8fea705e-8b46-4f1c-9c76-316b3d61e943.png

— Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/finufft/issues/309#issuecomment-1712209743, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNZRSTENHUO23FL3K47DADXZOARVANCNFSM6AAAAAAZ7NODQY . You are receiving this because you commented.Message ID: @.***>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

lu1and10 commented 1 year ago

Libin: Thanks for the timing. So a 512^3 (fine grid FFT) should be 32ms?

Yes, I tried once on h100 card.

Also for the detailed explanation - I agree, except for the fact that "shared memory" (on each core, used in meth 2, "SM" in paper) is supposed to be much faster than global memory (meth 1, "GM-sort" in paper). In our paper this is why SM was faster even though the density (another word for your NU pts per grid pt = M/8N) was around 1, not that high.

Yes, I agree that shared memory is much faster than global memory. I mean when M/8N is much smaller than 1, method 2 has much more thread divergence which causes the slow down. If M/8N is much smaller than 1, method 1's global memory atomic write is still slow, but there is much less atomic race condition(atomic write is very slow if more threads try to update same memory location, if threads are updating different locations, atomic write should be same speed as global memory write for method 1). I'm curious about for some medium size N say 128128128, there should be a critical ratio M/8N that method 2 becomes slow than method 1, though not sure this ratio is gpu device dependent or not.

lu1and10 commented 1 year ago

So a 512^3 (fine grid FFT) should be 32ms? Seems about right.

Sorry, I missed the upsample factor; I was timing the fft in cufinufft code. To correct, cufft 3d single precision complex to complex on h100 for 512^3 size takes about 4ms, 1024^3 takes about 32ms.

ahbarnett commented 12 months ago

wow, that's 33 G pts/sec for complex single-prec cuFFT. nice. But H100 is a recent card and costs $30k :)

In general we need to set up some more informative benchmark code (ie, report NU pts/sec for some std problem sizes), and explain (eg in docs/devnotes.rst) how to run them via cmake.

blackwer commented 12 months ago

Robert: thanks for setting default to horner (1). The ns^3 ker evals was an old bug of Melody's that should be totally removed - I thought there was no way to trigger that any more. (ie even kereval=0 should do 3*ns evals in 3D...)

Yep. This fix for this is already on master now. I actually merged the two codepaths so the spread/interp uses the exact same code excepting the kernel evaluator. I've also merged in some improvements to cuperftest, though it still doesn't breakdown the execute step. I use nsys for those kinds of breakdowns

rm -f report*; nsys profile ./perftest/cuda/cuperftest --N1=128 --N2=128 --N3=128 --M=$((10 * 128**3)) --m=4; nsys stats report1.nsys-rep --report gpukernsum --timeunit milliseconds

there should be a critical ratio M/8N that method 2 becomes slow than method 1, though not sure this ratio is gpu device dependent or not.

I was finding on my A6000 a crossover at around M = 5N iirc

ImanHosseini commented 12 months ago

I guess that A. the crossover is gpu-dependent. and B. also dependent on N. Here is what I have from 40x40 data points for various N, M/8N: (on 3060 Ti) heatmap_Rdata_r40_mm16777216 Where on V100, it seems to be more in line with what you see on A6000: heatmap_Rdata_r40_mm16777216_V100

blackwer commented 12 months ago

I guess that A. the crossover is gpu-dependent. and B. also dependent on N. Here is what I have from 40x40 data points for various N, M/8N: (on 3060 Ti)

Thanks for this! Could be useful as groundwork for a heuristic method selector. I have one point though: if you're using the total field of the output, this can be problematic with small N, since it includes the makeplan call. Most users interested in high throughput will only makeplan once. makeplan is also subject to considerably more noise in timings. You might want to consider just setpts+execute, or just execute as both of those are more common workflows.

janden commented 11 months ago

@ImanHosseini Thanks for these timings. Perhaps there is something to be gained from a smarter selection of method here (between 1 and 2). That being said, it does seem very GPU-dependent, so the question is here whether we would need some sort of calibration step. But when would we run this? Or should we have something akin to FFTW_MEASURE here? Sounds like it could get quite complicated and it might just be better to let the user make a (well-informed) decision about these options.

Getting back to the main topic here, method 4. From what I can tell, it's (a) experimental and (b) not necessarily faster than the standard methods (1 and 2). The natural thing would be just to remove it in that case (same as we did with method 3).

blackwer commented 11 months ago

Getting back to the main topic here, method 4. From what I can tell, it's (a) experimental and (b) not necessarily faster than the standard methods (1 and 2). The natural thing would be just to remove it in that case (same as we did with method 3).

It is actually faster in many instances, so we shouldn't remove it. I've also since fixed the codepath issues with it. I think it could stand to have a bit more rigorous testing before the next release though.

Or should we have something akin to FFTW_MEASURE here?

I'm not against it -- I think it's a fine idea -- but should probably be put on the backburner for a while. Seems like a fun "rainy day" project.

janden commented 11 months ago

It is actually faster in many instances, so we shouldn't remove it. I've also since fixed the codepath issues with it. I think it could stand to have a bit more rigorous testing before the next release though.

Ok got it. Would be interesting to see the numbers on this when you have time.

I'm not against it -- I think it's a fine idea -- but should probably be put on the backburner for a while. Seems like a fun "rainy day" project.

Agreed.

ahbarnett commented 2 weeks ago

blockgather can be 2-3x faster than either SM or GM, see recent tests in discussion of #498