Closed j-c-c closed 10 months ago
I cannot reproduce your results. Pasting your script to a file and running I get:
(<class 'numpy.float64'>, <class 'numpy.complex128'>)
allclose: True, max_abs_diff: 0.0, rmse: 0.0
allclose: True, max_abs_diff: 0.0, rmse: 0.0
allclose: True, max_abs_diff: 0.0, rmse: 0.0
(<class 'numpy.float32'>, <class 'numpy.complex64'>)
allclose: True, max_abs_diff: 0.0, rmse: 0.0
allclose: True, max_abs_diff: 0.0, rmse: 0.0
allclose: True, max_abs_diff: 0.0, rmse: 0.0
I am on finufft 2.2.0.dev0 (the current pip install version), python 3.10, numpy 1.22.2. Same for 1-thread or 16-threads. As expected there is zero difference. Can you give version, OS, CPU details? (OMP_NUM_THREADS, etc).
I get same result on master using local pip install as per Issue #340. So, we'd be interested in your setup to cause this error. Thanks! Alex
PS I changed ntrans=3
to ntrans=N
and removed a comma on line for f_vol_2
.
We also prefer dtype=complex_dtype
in the plan call now. Results are same.
I get same result on master using local pip install as per Issue #340. So, we'd be interested in your setup to cause this error. Thanks! Alex
PS I changed
ntrans=3
tontrans=N
and removed a comma on line forf_vol_2
. We also preferdtype=complex_dtype
in the plan call now. Results are same.
Same here, the difference is zero. I used the pypi hosted wheel to install 2.1.0 released version, finufft-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
I will be good to see the failing setup for debugging.
Apologies, I should have provided some more detail.
I am using the finufft version 2.2.0.dev0 cloned from master (getting the same behavior from 2.1.0), numpy 1.24.4, python 3.8, on a dual socket machine with an Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz chip. For what it's worth we are unable to reproduce this behavior on another machine.
After setting OMP_NUM_THREADS
I'm seeing that we are only failing for greater than 24 threads.
Apologies, I should have provided some more detail.
I am using the finufft version 2.2.0.dev0 cloned from master (getting the same behavior from 2.1.0), numpy 1.24.4, python 3.8, on a dual socket machine with an Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz chip. For what it's worth we are unable to reproduce this behavior on another machine.
After setting
OMP_NUM_THREADS
I'm seeing that we are only failing for greater than 24 threads.
Thanks, I'm also getting failing for threads greater than 24 now.
Apologies, I should have provided some more detail.
I am using the finufft version 2.2.0.dev0 cloned from master (getting the same behavior from 2.1.0), numpy 1.24.4, python 3.8, on a dual socket machine with an Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz chip. For what it's worth we are unable to reproduce this behavior on another machine.
After setting
OMP_NUM_THREADS
I'm seeing that we are only failing for greater than 24 threads.
For res=23
, it gets 24 threads breaks. For res=m
, it gets OMP_NUM_THREADS=m+1
breaks for single precision. It seems to be some problem with more number of threads than the problem size.
Fascinating. I don't think anyone has run such small problems on so many threads before - a new edge case! (In fact, you would probably be faster on 1 or just a few threads here, because of the small problem size, as we explain here: https://finufft.readthedocs.io/en/latest/trouble.html ) We will look into it!
On Mon, Oct 16, 2023 at 4:30 PM Libin Lu @.***> wrote:
Apologies, I should have provided some more detail.
I am using the finufft version 2.2.0.dev0 cloned from master (getting the same behavior from 2.1.0), numpy 1.24.4, python 3.8, on a dual socket machine with an Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz chip. For what it's worth we are unable to reproduce this behavior on another machine.
After setting OMP_NUM_THREADS I'm seeing that we are only failing for greater than 24 threads.
For res=23, it gets 24 threads breaks. For res=m, it gets OMP_NUM_THREADS=m+1 breaks for single precision. It seems to be some problem with more number of threads than the problem size.
— Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/finufft/issues/363#issuecomment-1765223825, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNZRSRQQNEKBFRV4G7K5YLX7WKNPAVCNFSM6AAAAAA6CJJGB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRVGIZDGOBSGU . You are receiving this because you commented.Message ID: @.***>
-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942
FWIW, I can't get it to fail with res=7
on 8-core Ryzen2 laptop running
ubuntu 22.10, 8 or 16 threads, etc.
On Mon, Oct 16, 2023 at 5:05 PM Alex Barnett @.***> wrote:
Fascinating. I don't think anyone has run such small problems on so many threads before - a new edge case! (In fact, you would probably be faster on 1 or just a few threads here, because of the small problem size, as we explain here: https://finufft.readthedocs.io/en/latest/trouble.html ) We will look into it!
On Mon, Oct 16, 2023 at 4:30 PM Libin Lu @.***> wrote:
Apologies, I should have provided some more detail.
I am using the finufft version 2.2.0.dev0 cloned from master (getting the same behavior from 2.1.0), numpy 1.24.4, python 3.8, on a dual socket machine with an Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz chip. For what it's worth we are unable to reproduce this behavior on another machine.
After setting OMP_NUM_THREADS I'm seeing that we are only failing for greater than 24 threads.
For res=23, it gets 24 threads breaks. For res=m, it gets OMP_NUM_THREADS=m+1 breaks for single precision. It seems to be some problem with more number of threads than the problem size.
— Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/finufft/issues/363#issuecomment-1765223825, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNZRSRQQNEKBFRV4G7K5YLX7WKNPAVCNFSM6AAAAAA6CJJGB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRVGIZDGOBSGU . You are receiving this because you commented.Message ID: @.***>
--
*-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942
-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942
Fascinating. I don't think anyone has run such small problems on so many threads before - a new edge case! (In fact, you would probably be faster on 1 or just a few threads here, because of the small problem size, as we explain here: https://finufft.readthedocs.io/en/latest/trouble.html ) We will look into it!
Yeah, definitely a small problem for multiple threads! This started out as a slightly larger problem (res=41
) in one of our unit tests. We found it odd that it was failing only on the one machine, so figured it would be good to report.
@ahbarnett and @lu1and10, thanks for the feedback and for looking into it!
It seems to be associated with the number of threads, number of physical
cores and probrem size.
While I can't get res=7
breaks with any number of cores using srun to
request a designed number of cores on the fi cluster.
On my desktop with 24 cores, the res=23
works with 1 to 8 threads,
breaks with 9 to 16 threads, then works with 17 to 24 threads, breaks with
25 to 48 threads, works with 49 to 72 threads, breaks with 73 and beyond
threads.
All double precision works, only single precision breaks. Interesting...
Need to reproduce on C++ test and debug more.
On Mon, Oct 16, 2023 at 5:11 PM Alex Barnett @.***> wrote:
FWIW, I can't get it to fail with
res=7
on 8-core Ryzen2 laptop running ubuntu 22.10, 8 or 16 threads, etc.On Mon, Oct 16, 2023 at 5:05 PM Alex Barnett @.***> wrote:
Fascinating. I don't think anyone has run such small problems on so many threads before - a new edge case! (In fact, you would probably be faster on 1 or just a few threads here, because of the small problem size, as we explain here: https://finufft.readthedocs.io/en/latest/trouble.html ) We will look into it!
On Mon, Oct 16, 2023 at 4:30 PM Libin Lu @.***> wrote:
Apologies, I should have provided some more detail.
I am using the finufft version 2.2.0.dev0 cloned from master (getting the same behavior from 2.1.0), numpy 1.24.4, python 3.8, on a dual socket machine with an Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz chip. For what it's worth we are unable to reproduce this behavior on another machine.
After setting OMP_NUM_THREADS I'm seeing that we are only failing for greater than 24 threads.
For res=23, it gets 24 threads breaks. For res=m, it gets OMP_NUM_THREADS=m+1 breaks for single precision. It seems to be some problem with more number of threads than the problem size.
— Reply to this email directly, view it on GitHub < https://github.com/flatironinstitute/finufft/issues/363#issuecomment-1765223825>,
or unsubscribe < https://github.com/notifications/unsubscribe-auth/ACNZRSRQQNEKBFRV4G7K5YLX7WKNPAVCNFSM6AAAAAA6CJJGB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRVGIZDGOBSGU>
. You are receiving this because you commented.Message ID: @.***>
--
*-------------------------------------------------------------------~^`^~._.~'
|\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942
-- *-------------------------------------------------------------------~^`^~._.~'
|\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942
— Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/finufft/issues/363#issuecomment-1765284215, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMXDUKG2WD3LJQNB272WLB3X7WPHTAVCNFSM6AAAAAA6CJJGB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRVGI4DIMRRGU . You are receiving this because you commented.Message ID: @.***>
thanks for catch up this, we need to look into it, let us know if you have any finding.
My response to your findings about thread count, Libin, can only be a polite "WTF!" :)
I suggest we look into FFTW, (just in case!), and the batching steps in src/finufft.cpp ...
Your intuition about fftw seems to be correct, changing FFTW_ESTIMATE to FFTW_MEASURE, it breaks at different number of threads. Also linking to intel mkl(mkl has fft with same interface as fftw), I don’t see the difference break down. Need to make a simplest reproducible example only to run fftw to confirm.
On Oct 16, 2023, at 19:14, Alex Barnett @.***> wrote:
My response to your findings about thread count, Libin, can only be a polite "WTF!" :)
I suggest we look into FFTW, (just in case!), and the batching steps in src/finufft.cpp ...
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.
I've done a fair amount of investigating on this. Initially I thought this was an FFTW bug, but I'm starting to think it's more of a feature. Here's my basic findings
~5E-5
for floats and ~1E-13
for doubles. This is typical for a lot of ordering/cancellation issuesMKL
doesn't naively reproduce the relative errors as the script is written, using a different number of threads for the single transforms vs the batch DOES have similar relative errors. I suspect MKL
is threading inside of each individual transform rather than around the batch.finufft
stores all plans with the number of threads it first encounters. I consider this a bug and it should use whatever the current number of threads is. The fix is trivial from #354 and would allow people to use nice libraries like threadpoolctl
to use different thread counts across different callsI've done a fair amount of investigating on this. Initially I thought this was an FFTW bug, but I'm starting to think it's more of a feature. Here's my basic findings
- max relative error for a float is always at worst
~5E-5
for floats and ~1E-13
for doubles. This is typical for a lot of ordering/cancellation issues- while
MKL
doesn't naively reproduce the relative errors as the script is written, using a different number of threads for the single transforms vs the batch DOES have similar relative errors. I suspectMKL
is threading inside of each individual transform rather than around the batch.finufft
stores all plans with the number of threads it first encounters. I consider this a bug and it should use whatever the current number of threads is. The fix is trivial from bugfix/enhance: change all FFTW critical to mutex. nix FFTW_PLAN_SAFE #354 and would allow people to use nice libraries likethreadpoolctl
to use different thread counts across different calls
Thanks! Your analysis convince me that it is a feature of fftw, the relative difference from fftw of different plans are within the single/double precision tolerance.
@j-c-c if you check your code's relative error instead of the abs error, it seems to be within the single/double precision tolerance. What do you think?
I'm still trying to isolate this with a pure fftw
case so I wouldn't consider this closed just yet, but that's where the evidence seems to be pointing heavily.
edit:
the fft
is a red herring. The input to the fft
varies depending on the number of threads provided. Will dig in tomorrow to find the culprit
@j-c-c After a bunch of analysis, this should be marked more of a "feature" than a bug. finufft
only guarantees that your answer is correct to some tolerance, but not that the result will always be the same. I can't speak for @ahbarnett but... unfortunately, due to how fftw
handles parallelism, I don't think we can promise consistency with ntransf
for any number of threads but 1, since its results vary with the number of threads depending on batch size.
@ahbarnett should we just mark this as wontfix
and close?
@j-c-c After a bunch of analysis, this should be marked more of a "feature" than a bug. finufft only guarantees that your answer is correct to some tolerance, but not that the result will always be the same. I can't speak for @ahbarnett but... unfortunately, due to how fftw handles parallelism, I don't think we can promise consistency with ntransf for any number of threads but 1, since its results vary with the number of threads depending on batch size.
@blackwer Thank you for the deep dive here. Thanks @lu1and10 and @ahbarnett as well. For our sake it is enough to know what guarantees to expect.
Best, Josh
Closing as wontfix
due to performance implications and limitations of fftw
Using the python interface on Linux, I find that I'm getting different results when I transform a volume stack compared to transforming the volumes of the stack individually. I only see this behavior when using single precision.
I've tested this out on the latest stable release as well as a build from master. Here is a test script I'm using:
Below is the output of the script. The stack is
allclose
to singletons in double precision, but we're pretty far off in singles.Admittedly, the volume is a little contrived but I get the failure on a random volume (commented out in the script) as well.