Different result when transforming stack vs singleton

j-c-c commented 10 months ago

Using the python interface on Linux, I find that I'm getting different results when I transform a volume stack compared to transforming the volumes of the stack individually. I only see this behavior when using single precision.

I've tested this out on the latest stable release as well as a build from master. Here is a test script I'm using:

import finufft                                                                                                            
import numpy as np                                                                                                        

singles = np.float32, np.complex64                                                                                        
doubles = np.float64, np.complex128                                                                                       

def run_finufft(dtypes):                                                                                                  
    # Generate a stack of volumes.                                                                                        
    dtype, complex_dtype = dtypes                                                                                         
    res = 23                                                                                                              
    N = 3  # number of volumes                                                                                            
    vols = np.arange(1, 1 + N * res**3, dtype=dtype).reshape(N, res, res, res)                                            
    # np.random.seed(1)                                                                                                   
    # vols = np.random.randn(N, res, res, res)                                                                            
    vols = np.array(vols, copy=False, dtype=complex_dtype, order="C")                                                     

    # Generate set of fourier grid points.                                                                                
    pts_1d = np.linspace(-np.pi, np.pi, res, dtype=dtype)                                                                 
    x, y, z = np.meshgrid(pts_1d, pts_1d, pts_1d)                                                                         
    fourier_pts = np.vstack([x.flatten(), y.flatten(), z.flatten()])                                                      

    # Make finufft plan for stack of volumes.                                                                             
    eps = 1e-8                                                                                                            
    if np.finfo(dtype).eps > eps:                                                                                         
        eps = np.finfo(dtype).eps                                                                                         
    transform_plan = finufft.Plan(                                                                                        
        nufft_type=2, n_modes_or_dim=(res,) * 3, eps=eps, n_trans=3, dtype=dtype                                          
    )                                                                                                                     
    transform_plan.setpts(*fourier_pts)                                                                                   

    # Execute transform on volume stack.                                                                                  
    f_vols = transform_plan.execute(vols)                                                                                 
    f_vols = f_vols.reshape(-1, res, res, res)                                                                            

    # Make finufft plan for single volumes.                                                                               
    transform_plan = finufft.Plan(                                                                                        
        nufft_type=2, n_modes_or_dim=(res,) * 3, eps=eps, n_trans=1, dtype=dtype                                          
    )                                                                                                                     
    transform_plan.setpts(*fourier_pts)                                                                                   

    # Execute transform on single volumes.                                                                                
    f_vol_0 = transform_plan.execute(vols[0]).reshape(res, res, res)                                                      
    f_vol_1 = transform_plan.execute(vols[1]).reshape(res, res, res)                                                      
    f_vol_2 = transform_plan.execute(                                                                                     
        vols[2],                                                                                                          
    ).reshape(res, res, res)                                                                                              

    # Test that we get the same transforms with stack and singletons.                                                     
    print(dtypes)                                                                                                         
    print(                                                                                                                
        f"allclose: {np.allclose(f_vols[0], f_vol_0)}, ",                                                                 
        f"max_abs_diff: {np.max(abs(f_vols[0] - f_vol_0))}, ",                                                            
        f"rmse: {np.sqrt(np.mean(abs(f_vols[0] - f_vol_0)**2))}",                                                         
    )                                                                                                                     
    print(                                                                                                                
        f"allclose: {np.allclose(f_vols[1], f_vol_1)}, ",                                                                 
        f"max_abs_diff: {np.max(abs(f_vols[1] - f_vol_1))}, ",                                                            
        f"rmse: {np.sqrt(np.mean(abs(f_vols[1] - f_vol_1)**2))}",                                                         
    )                                                                                                                     
    print(                                                                                                                
        f"allclose: {np.allclose(f_vols[2], f_vol_2)}, ",                                                                 
        f"max_abs_diff: {np.max(abs(f_vols[2] - f_vol_2))}, ",                                                            
        f"rmse: {np.sqrt(np.mean(abs(f_vols[2] - f_vol_2)**2))}",                                                         
    )                                                                                                                     

run_finufft(doubles)                                                                                                      
run_finufft(singles)

Below is the output of the script. The stack is allclose to singletons in double precision, but we're pretty far off in singles.

(<class 'numpy.float64'>, <class 'numpy.complex128'>)
allclose: True,  max_abs_diff: 5.587935447692871e-09,  rmse: 1.3931081848390707e-10
allclose: True,  max_abs_diff: 7.4744308697948e-09,  rmse: 3.2122794922856015e-10
allclose: True,  max_abs_diff: 1.626484952547865e-08,  rmse: 4.898117388072168e-10
(<class 'numpy.float32'>, <class 'numpy.complex64'>)
allclose: False,  max_abs_diff: 2.0,  rmse: 0.055754758417606354
allclose: False,  max_abs_diff: 16.03447914123535,  rmse: 0.21659733355045319
allclose: False,  max_abs_diff: 8.000244140625,  rmse: 0.24948608875274658

Admittedly, the volume is a little contrived but I get the failure on a random volume (commented out in the script) as well.

ahbarnett commented 10 months ago

I cannot reproduce your results. Pasting your script to a file and running I get:

(<class 'numpy.float64'>, <class 'numpy.complex128'>)
allclose: True,  max_abs_diff: 0.0,  rmse: 0.0
allclose: True,  max_abs_diff: 0.0,  rmse: 0.0
allclose: True,  max_abs_diff: 0.0,  rmse: 0.0
(<class 'numpy.float32'>, <class 'numpy.complex64'>)
allclose: True,  max_abs_diff: 0.0,  rmse: 0.0
allclose: True,  max_abs_diff: 0.0,  rmse: 0.0
allclose: True,  max_abs_diff: 0.0,  rmse: 0.0

I am on finufft 2.2.0.dev0 (the current pip install version), python 3.10, numpy 1.22.2. Same for 1-thread or 16-threads. As expected there is zero difference. Can you give version, OS, CPU details? (OMP_NUM_THREADS, etc).

ahbarnett commented 10 months ago

I get same result on master using local pip install as per Issue #340. So, we'd be interested in your setup to cause this error. Thanks! Alex

PS I changed ntrans=3 to ntrans=N and removed a comma on line for f_vol_2. We also prefer dtype=complex_dtype in the plan call now. Results are same.

lu1and10 commented 10 months ago

I get same result on master using local pip install as per Issue #340. So, we'd be interested in your setup to cause this error. Thanks! Alex

PS I changed ntrans=3 to ntrans=N and removed a comma on line for f_vol_2. We also prefer dtype=complex_dtype in the plan call now. Results are same.

Same here, the difference is zero. I used the pypi hosted wheel to install 2.1.0 released version, finufft-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

I will be good to see the failing setup for debugging.

j-c-c commented 10 months ago

Apologies, I should have provided some more detail.

I am using the finufft version 2.2.0.dev0 cloned from master (getting the same behavior from 2.1.0), numpy 1.24.4, python 3.8, on a dual socket machine with an Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz chip. For what it's worth we are unable to reproduce this behavior on another machine.

After setting OMP_NUM_THREADS I'm seeing that we are only failing for greater than 24 threads.

lu1and10 commented 10 months ago

Apologies, I should have provided some more detail.

I am using the finufft version 2.2.0.dev0 cloned from master (getting the same behavior from 2.1.0), numpy 1.24.4, python 3.8, on a dual socket machine with an Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz chip. For what it's worth we are unable to reproduce this behavior on another machine.

After setting OMP_NUM_THREADS I'm seeing that we are only failing for greater than 24 threads.

Thanks, I'm also getting failing for threads greater than 24 now.

lu1and10 commented 10 months ago

Apologies, I should have provided some more detail.

I am using the finufft version 2.2.0.dev0 cloned from master (getting the same behavior from 2.1.0), numpy 1.24.4, python 3.8, on a dual socket machine with an Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz chip. For what it's worth we are unable to reproduce this behavior on another machine.

After setting OMP_NUM_THREADS I'm seeing that we are only failing for greater than 24 threads.

For res=23, it gets 24 threads breaks. For res=m, it gets OMP_NUM_THREADS=m+1 breaks for single precision. It seems to be some problem with more number of threads than the problem size.

ahbarnett commented 10 months ago

Fascinating. I don't think anyone has run such small problems on so many threads before - a new edge case! (In fact, you would probably be faster on 1 or just a few threads here, because of the small problem size, as we explain here: https://finufft.readthedocs.io/en/latest/trouble.html ) We will look into it!

On Mon, Oct 16, 2023 at 4:30 PM Libin Lu @.***> wrote:

Apologies, I should have provided some more detail.

I am using the finufft version 2.2.0.dev0 cloned from master (getting the same behavior from 2.1.0), numpy 1.24.4, python 3.8, on a dual socket machine with an Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz chip. For what it's worth we are unable to reproduce this behavior on another machine.

After setting OMP_NUM_THREADS I'm seeing that we are only failing for greater than 24 threads.

For res=23, it gets 24 threads breaks. For res=m, it gets OMP_NUM_THREADS=m+1 breaks for single precision. It seems to be some problem with more number of threads than the problem size.

— Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/finufft/issues/363#issuecomment-1765223825, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNZRSRQQNEKBFRV4G7K5YLX7WKNPAVCNFSM6AAAAAA6CJJGB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRVGIZDGOBSGU . You are receiving this because you commented.Message ID: @.***>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

ahbarnett commented 10 months ago

FWIW, I can't get it to fail with res=7 on 8-core Ryzen2 laptop running ubuntu 22.10, 8 or 16 threads, etc.

On Mon, Oct 16, 2023 at 5:05 PM Alex Barnett @.***> wrote:

Fascinating. I don't think anyone has run such small problems on so many threads before - a new edge case! (In fact, you would probably be faster on 1 or just a few threads here, because of the small problem size, as we explain here: https://finufft.readthedocs.io/en/latest/trouble.html ) We will look into it!

On Mon, Oct 16, 2023 at 4:30 PM Libin Lu @.***> wrote:

Apologies, I should have provided some more detail.

I am using the finufft version 2.2.0.dev0 cloned from master (getting the same behavior from 2.1.0), numpy 1.24.4, python 3.8, on a dual socket machine with an Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz chip. For what it's worth we are unable to reproduce this behavior on another machine.

After setting OMP_NUM_THREADS I'm seeing that we are only failing for greater than 24 threads.

For res=23, it gets 24 threads breaks. For res=m, it gets OMP_NUM_THREADS=m+1 breaks for single precision. It seems to be some problem with more number of threads than the problem size.

— Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/finufft/issues/363#issuecomment-1765223825, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNZRSRQQNEKBFRV4G7K5YLX7WKNPAVCNFSM6AAAAAA6CJJGB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRVGIZDGOBSGU . You are receiving this because you commented.Message ID: @.***>

--

*-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

j-c-c commented 10 months ago

Fascinating. I don't think anyone has run such small problems on so many threads before - a new edge case! (In fact, you would probably be faster on 1 or just a few threads here, because of the small problem size, as we explain here: https://finufft.readthedocs.io/en/latest/trouble.html ) We will look into it!

Yeah, definitely a small problem for multiple threads! This started out as a slightly larger problem (res=41) in one of our unit tests. We found it odd that it was failing only on the one machine, so figured it would be good to report.

@ahbarnett and @lu1and10, thanks for the feedback and for looking into it!

lu1and10 commented 10 months ago

It seems to be associated with the number of threads, number of physical cores and probrem size. While I can't get res=7 breaks with any number of cores using srun to request a designed number of cores on the fi cluster. On my desktop with 24 cores, the res=23 works with 1 to 8 threads, breaks with 9 to 16 threads, then works with 17 to 24 threads, breaks with 25 to 48 threads, works with 49 to 72 threads, breaks with 73 and beyond threads. All double precision works, only single precision breaks. Interesting... Need to reproduce on C++ test and debug more.

On Mon, Oct 16, 2023 at 5:11 PM Alex Barnett @.***> wrote:

FWIW, I can't get it to fail with res=7 on 8-core Ryzen2 laptop running ubuntu 22.10, 8 or 16 threads, etc.

On Mon, Oct 16, 2023 at 5:05 PM Alex Barnett @.***> wrote:

Fascinating. I don't think anyone has run such small problems on so many threads before - a new edge case! (In fact, you would probably be faster on 1 or just a few threads here, because of the small problem size, as we explain here: https://finufft.readthedocs.io/en/latest/trouble.html ) We will look into it!

On Mon, Oct 16, 2023 at 4:30 PM Libin Lu @.***> wrote:

Apologies, I should have provided some more detail.

I am using the finufft version 2.2.0.dev0 cloned from master (getting the same behavior from 2.1.0), numpy 1.24.4, python 3.8, on a dual socket machine with an Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz chip. For what it's worth we are unable to reproduce this behavior on another machine.

After setting OMP_NUM_THREADS I'm seeing that we are only failing for greater than 24 threads.

For res=23, it gets 24 threads breaks. For res=m, it gets OMP_NUM_THREADS=m+1 breaks for single precision. It seems to be some problem with more number of threads than the problem size.

— Reply to this email directly, view it on GitHub < https://github.com/flatironinstitute/finufft/issues/363#issuecomment-1765223825>,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACNZRSRQQNEKBFRV4G7K5YLX7WKNPAVCNFSM6AAAAAA6CJJGB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRVGIZDGOBSGU>

. You are receiving this because you commented.Message ID: @.***>

--

*-------------------------------------------------------------------~^`^~._.~'

|\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

-- *-------------------------------------------------------------------~^`^~._.~'

|\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

— Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/finufft/issues/363#issuecomment-1765284215, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMXDUKG2WD3LJQNB272WLB3X7WPHTAVCNFSM6AAAAAA6CJJGB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRVGI4DIMRRGU . You are receiving this because you commented.Message ID: @.***>

lu1and10 commented 10 months ago

thanks for catch up this, we need to look into it, let us know if you have any finding.

ahbarnett commented 10 months ago

My response to your findings about thread count, Libin, can only be a polite "WTF!" :)

I suggest we look into FFTW, (just in case!), and the batching steps in src/finufft.cpp ...

lu1and10 commented 10 months ago

Your intuition about fftw seems to be correct, changing FFTW_ESTIMATE to FFTW_MEASURE, it breaks at different number of threads. Also linking to intel mkl(mkl has fft with same interface as fftw), I don’t see the difference break down. Need to make a simplest reproducible example only to run fftw to confirm.

On Oct 16, 2023, at 19:14, Alex Barnett @.***> wrote:

My response to your findings about thread count, Libin, can only be a polite "WTF!" :)

I suggest we look into FFTW, (just in case!), and the batching steps in src/finufft.cpp ...

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

blackwer commented 10 months ago

I've done a fair amount of investigating on this. Initially I thought this was an FFTW bug, but I'm starting to think it's more of a feature. Here's my basic findings

max relative error for a float is always at worst ~5E-5 for floats and ~1E-13 for doubles. This is typical for a lot of ordering/cancellation issues
while MKL doesn't naively reproduce the relative errors as the script is written, using a different number of threads for the single transforms vs the batch DOES have similar relative errors. I suspect MKL is threading inside of each individual transform rather than around the batch.
finufft stores all plans with the number of threads it first encounters. I consider this a bug and it should use whatever the current number of threads is. The fix is trivial from #354 and would allow people to use nice libraries like threadpoolctl to use different thread counts across different calls

lu1and10 commented 10 months ago

I've done a fair amount of investigating on this. Initially I thought this was an FFTW bug, but I'm starting to think it's more of a feature. Here's my basic findings

max relative error for a float is always at worst ~5E-5 for floats and ~1E-13 for doubles. This is typical for a lot of ordering/cancellation issues

while MKL doesn't naively reproduce the relative errors as the script is written, using a different number of threads for the single transforms vs the batch DOES have similar relative errors. I suspect MKL is threading inside of each individual transform rather than around the batch.

finufft stores all plans with the number of threads it first encounters. I consider this a bug and it should use whatever the current number of threads is. The fix is trivial from bugfix/enhance: change all FFTW critical to mutex. nix FFTW_PLAN_SAFE #354 and would allow people to use nice libraries like threadpoolctl to use different thread counts across different calls

Thanks! Your analysis convince me that it is a feature of fftw, the relative difference from fftw of different plans are within the single/double precision tolerance.

@j-c-c if you check your code's relative error instead of the abs error, it seems to be within the single/double precision tolerance. What do you think?

blackwer commented 10 months ago

I'm still trying to isolate this with a pure fftw case so I wouldn't consider this closed just yet, but that's where the evidence seems to be pointing heavily. edit: the fft is a red herring. The input to the fft varies depending on the number of threads provided. Will dig in tomorrow to find the culprit

blackwer commented 10 months ago

@j-c-c After a bunch of analysis, this should be marked more of a "feature" than a bug. finufft only guarantees that your answer is correct to some tolerance, but not that the result will always be the same. I can't speak for @ahbarnett but... unfortunately, due to how fftw handles parallelism, I don't think we can promise consistency with ntransf for any number of threads but 1, since its results vary with the number of threads depending on batch size.

@ahbarnett should we just mark this as wontfix and close?

j-c-c commented 10 months ago

@j-c-c After a bunch of analysis, this should be marked more of a "feature" than a bug. finufft only guarantees that your answer is correct to some tolerance, but not that the result will always be the same. I can't speak for @ahbarnett but... unfortunately, due to how fftw handles parallelism, I don't think we can promise consistency with ntransf for any number of threads but 1, since its results vary with the number of threads depending on batch size.

@blackwer Thank you for the deep dive here. Thanks @lu1and10 and @ahbarnett as well. For our sake it is enough to know what guarantees to expect.

Best, Josh

blackwer commented 10 months ago

Closing as wontfix due to performance implications and limitations of fftw

flatironinstitute / finufft

Different result when transforming stack vs singleton #363