Help with possible pts dependent bug

garrettwrong commented 2 years ago

Hi, I've been debugging an issue I initially found in ASPIRE running our test suite via cufinufft. (We're trying to revisit our GPU efforts and get them running in CI continually...).

I was able to reduce it to a problem that only requires invoking two calls to cufinufft using one of our fourier point sets. Using a different set of points (commented out below) does not seem to induce the behavior afaict.

I am currently trying to debug it, but thought it might be worth floating here as well while I do that...

import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import pycuda.gpuarray as gpuarray
from cufinufft import cufinufft

# ====== Initial Example Data Load

# Our Fourier Points
freq = np.load('freq.npy')

# # Dummy grid, seems to work
# freq = np.mgrid[-np.pi:np.pi:16j, -np.pi:np.pi:8j].astype(np.float32).reshape(2,-1)

# mod doesn't seem to matter, so commenting out
# print('min max freqs:', np.min(freq), np.max(freq), freq)
# pts = np.mod(freq + np.pi, 2 * np.pi, ) - np.pi 
# print('min max pts:',np.min(pts), np.max(pts), pts)
pts = freq

# Signal
signal = np.ones((1,128), dtype=np.complex64)

# Sizes
sz = (8, 8)
ntransforms = signal.shape[0]
res_shape = (ntransforms, *sz)

# ===== Make plans and arrays

# Plan details
eps = 1e-7

plan32 = cufinufft(1, sz, ntransforms, eps, 1, dtype=np.float32)
plan64 = cufinufft(1, sz, ntransforms, eps, 1, dtype=np.float64)

# Result Device Arrays
#_result_gpu32 = gpuarray.GPUArray(res_shape, dtype=np.complex64)
#_result_gpu64 = gpuarray.GPUArray(res_shape, dtype=np.complex128)
# Just in case... initialize
_result_gpu32 = gpuarray.to_gpu(np.zeros(res_shape, dtype=np.complex64))
_result_gpu64 = gpuarray.to_gpu(np.zeros(res_shape, dtype=np.complex128))

# HtoD
_pts_gpu32 = gpuarray.to_gpu(pts)
_pts_gpu64 = gpuarray.to_gpu(pts.astype(np.float64))

# Set the points
plan32.set_pts(*_pts_gpu32)
plan64.set_pts(*_pts_gpu64)

# ===== Execution

_signal_gpu32 = gpuarray.to_gpu(signal)
_signal_gpu64 = gpuarray.to_gpu(signal.astype(np.complex128))

#input("<enter>")

# the second execute will be junk
plan32.execute(_signal_gpu32, _result_gpu32)
plan64.execute(_signal_gpu64, _result_gpu64)

# DtoH
_result32 = _result_gpu32.get()
_result64 = _result_gpu64.get()

print(abs(_result32))
print(abs(_result64))   # will be junk....

I've attached a tarball with the freq array to this post.

Right now I'm going to be inspecting the points right before exec stage. I'm considering if there is some state or variables that might be scoped incorrectly; I have not found any yet.

Let me know if you have any ideas. Thanks!

issue_repro.tgz

ahbarnett commented 2 years ago

The first thing to look for would be global state of cufft. We had an issue like this with FFTW in FINUFFT (not involving 32 vs 64 bit types, but just with different threads). Maybe @MelodyShih @janden @JBlaschke have ideas.

PS @blackwer has started to integrate cufinufft into finufft, first at the repo level, but we plan to integrate as much code as possible, for long-term maintainability. There will be many decisions there, so I hope we can ping you about them or have a meeting.

Good to see you the other day. Best, Alex

garrettwrong commented 2 years ago

That makes sense. I did check that the plans have a distinct cufft plan but nothing further than that. I'll inspect some more there. Only failing for certain point sets is causing me some cognitive dissonance.

Yes that integration sounds great; big project. Was a pleasure, thanks!

garrettwrong commented 2 years ago

I think I narrowed corruption down to something occurring temporally around Step 3 of the second 2d1 execute; after returning from CUDECONVOLVE2D I see the junk... Continuing ...

garrettwrong commented 2 years ago

Getting back into debugging this. Seeing some strange values for fw, but haven't been able to track it down yet. I am able to tickle this just in single precision (without mixing data types or having multiple cufinufft plans). Using A100 w/ CUDA 11.7 driver and toolkit.

blackwer commented 2 years ago

Thanks Garrett! Let me know if there's anything I can do to help. So many changes it's hard to track this stuff down, so that kind of debugging should fall on me somewhat. I'd like to understand better anyway.

Edit: lol I thought this was on my PR. I'm still going to check it out.

On Tue, Oct 18, 2022 at 9:25 AM Garrett Wright @.***> wrote:

Getting back into debugging this. Seeing some strange values for fw, but haven't been able to track it down yet. I am able to tickle this just in single precision (without mixing data types or having multiple cufinufft plans).

— Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/cufinufft/issues/144#issuecomment-1282383214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACY7USQSZDH4HCKD2BPAQTWD2QL7ANCNFSM6AAAAAAQ6XBLH4 . You are receiving this because you were mentioned.Message ID: @.***>

garrettwrong commented 2 years ago

~Hi! Today I created a cuda only case using the point set from the earlier python script; hoping to totally remove Python from the equation, and make better use of my debugger time. Its basically just the shipped 2d api example changed to use my points and problem size. (Attached standalone cuda cufinufft example).~

~Seems like I can get bad behavior in singles all by itself for 2d pretty reliably with this. Doubles doesn't seem to have any trouble that I have seen. The test (of a single value) that ships with the code seems to succeed, but collectively the values under singles seem unwell for my test case. Example below. Maybe if you have a chance you could confirm if you can reproduce?~

~The non deterministic behavior is a little concerning to me. I haven't been able to reproduce the non-deterministic behavior under cuda-memcheck yet. (interesting)~

garrettwrong commented 2 years ago

There was an error in the points and printing for my last tarball, so please disregard. Will try to make another standalone test. Sorry about that.

garrettwrong commented 2 years ago

Easier to just stash work in a branch on my fork. Hopefully I'm making less mistakes now. Probably I should have gone right to cuda earlier when I was fresher, but I was initially inclined to think it was a bug in ASPIRE.

https://github.com/flatironinstitute/cufinufft/compare/master...garrettwrong:cufinufft:bug_example

ahbarnett commented 1 year ago

Hi Garrett, Can I close this issue or is there still a bug, IYHO ? :) Best, Alex

garrettwrong commented 1 year ago

Hrmm. I don't believe the bug was ever addressed directly. I haven't tried to reproduce it with the recent (final) version. I will try to reproduce it tomorrow and update the issue :).

garrettwrong commented 1 year ago

Hi again. I am still able to reproduce this issue's example using the v1.3 code built on our machine.

Linux caf.math.princeton.edu.private 3.10.0-1160.92.1.el7.x86_64 #1 SMP Tue Jun 6 10:04:24 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

rh/devtoolset/8

Cuda compilation tools, release 11.7, V11.7.64 Build cuda_11.7.r11.7/compiler.31294372_0


(base) ➜  cuf_13_bug_repro_144 git clone -b bug_example git@github.com:garrettwrong/cufinufft.git    
Cloning into 'cufinufft'...
Warning: the ECDSA host key for 'github.com' differs from the key for the IP address '140.82.114.3'
Offending key for IP in /u/gbwright/.ssh/known_hosts:11
Matching host key in /u/gbwright/.ssh/known_hosts:21
Are you sure you want to continue connecting (yes/no)? yes
remote: Enumerating objects: 4970, done.
remote: Counting objects: 100% (1284/1284), done.
remote: Compressing objects: 100% (148/148), done.
remote: Total 4970 (delta 1177), reused 1146 (delta 1136), pack-reused 3686
Receiving objects: 100% (4970/4970), 1.72 MiB | 0 bytes/s, done.
Resolving deltas: 100% (3579/3579), done.
(base) ➜  cuf_13_bug_repro_144 cd cufinufft 
(base) ➜  cufinufft git:(bug_example) git remote add upstream git@github.com:flatironinstitute/cufinufft.git
(base) ➜  cufinufft git:(bug_example) git fetch upstream

# < snip>

From github.com:flatironinstitute/cufinufft
 * [new tag]         v1.3       -> v1.3
(base) ➜  cufinufft git:(bug_example) git rebase v1.3      
First, rewinding head to replay your work on top of it...
Applying: Add site for tigergpu at Princeton
Applying: Update README with Princeton sites
Applying: add debug example cases
Applying: Add PACM site
Applying: debug symbols seems to tickle much more often...
(base) ➜  cufinufft git:(bug_example) module load  cudatoolkit/11.7 rh/devtoolset/8
(base) ➜  cufinufft git:(bug_example) make all site=PACM -j

<snip>

(base) ➜  cufinufft git:(bug_example) bin/example_1_32 
[gpu ] one targ: rel err in c[64] is 9.3e+04
fk[0:8]: -16466.515625, -6339.429688, 2939.932373, -18116.044922, -37109.199219, -18116.048828, 2939.932617, -6339.421387, -5856.102539, -11939.065430, 
(base) ➜  cufinufft git:(bug_example) bin/example_1_64
[gpu ] one targ: rel err in c[64] is 1.71e+03
fk[0:8]: 14.591308, 19.666844, 23.890922, 21.967780, 20.733888, 21.967777, 23.890929, 19.666834, 19.666833, 22.993997, 
(base) ➜  cufinufft git:(bug_example)

blackwer commented 1 year ago

Hi @garrettwrong. Significant changes have been made across the board in the current version of this in the finufft repo. I'll try reproducing it there instead and link in this issue from there if it's reproducible. If you'd rather do the lifting, let me know. I'll probably start working on it in 10 minutes or so :)

blackwer commented 1 year ago

@garrettwrong I'm unable to reproduce this on the current master branch of finufft (https://github.com/flatironinstitute/finufft/commit/710f6b602634b9ffa96b1e919784d0bf5bd90437). GCC 11.4.0, cuda toolkit 11.8.0, on A6000 with: Driver Version: 535.104.05 and CUDA Version: 12.2. Can you try to reproduce this in your environment? CLI recipe and source are below. Make sure to update the cuda architecture in the cmake command.

Edit: the double and the float have the same results, though the relative error is very high still. Not sure if the claim is that this is part of the issue

#include <iostream>
#include <iomanip>
#include <math.h>
#include <helper_cuda.h>
#include <complex>

#include <cufinufft.h>

template <typename T>
T infnorm(int n, std::complex<T> *a) {
    T nrm = 0.0;
    for (int m = 0; m < n; ++m) {
        T aa = real(conj(a[m]) * a[m]);
        if (aa > nrm)
            nrm = aa;
    }
    return sqrt(nrm);
}

using namespace std;

/* The following represent a point set saved from ASPIRE. */
float freqs1[128] = {
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.01216906, 0.062310982,
0.14539924, 0.25023413, 0.3626602, 0.46749508, 0.55058336, 0.6007253, 0.023870474, 0.12222739, 0.28521088, 0.49085197,
0.7113836, 0.9170247, 1.0800081, 1.178365, 0.034654554, 0.17744668, 0.41406205, 0.71260655, 1.0327688, 1.3313134,
1.5679288, 1.7107208, 0.04410688, 0.22584677, 0.52700096, 0.9069761, 1.3144654, 1.6944405, 1.9955949, 2.1773348,
0.051864203, 0.26556772, 0.6196876, 1.0664911, 1.5456479, 1.9924514, 2.3465714, 2.5602748, 0.05762841, 0.29508302,
0.68855995, 1.1850214, 1.7174318, 2.2138932, 2.6073701, 2.8448248, 0.061178003, 0.31325847, 0.7309714, 1.258012,
1.823216, 2.3502564, 2.7679696, 3.02005, 0.062376548, 0.31939557, 0.74529195, 1.2826579, 1.8589348, 2.3963008,
2.8221972, 3.0792162, 0.061177995, 0.31325844, 0.73097134, 1.2580119, 1.8232158, 2.3502564, 2.7679694, 3.0200498,
0.05762841, 0.29508302, 0.68855995, 1.1850214, 1.7174318, 2.2138932, 2.6073701, 2.8448248, 0.0518642, 0.2655677,
0.61968756, 1.0664909, 1.5456476, 1.9924511, 2.346571, 2.5602746, 0.04410688, 0.22584677, 0.52700096, 0.9069761,
1.3144654, 1.6944405, 1.9955949, 2.1773348, 0.03465455, 0.17744665, 0.414062, 0.7126065, 1.0327687, 1.3313133,
1.5679287, 1.7107207, 0.02387046, 0.12222735, 0.2852108, 0.49085173, 0.7113832, 0.9170242, 1.0800077, 1.1783645,
0.012169059, 0.06231098, 0.14539923, 0.2502341, 0.36266017, 0.46749502, 0.55058336, 0.60072523
};

float freqs2[128] = {
0.062376548, 0.31939557, 0.74529195, 1.2826579, 1.8589348, 2.3963008, 2.8221972, 3.0792162, 0.061177995, 0.31325844,
0.73097134, 1.2580119, 1.8232158, 2.3502564, 2.7679694, 3.0200498, 0.05762841, 0.29508302, 0.68855995, 1.1850214,
1.7174318, 2.2138932, 2.6073701, 2.8448248, 0.051864203, 0.2655677, 0.61968756, 1.066491, 1.5456477, 1.9924512,
2.3465712, 2.5602746, 0.04410688, 0.22584677, 0.52700096, 0.9069761, 1.3144654, 1.6944405, 1.9955949, 2.1773348,
0.03465455, 0.17744665, 0.414062, 0.7126065, 1.0327687, 1.3313133, 1.5679287, 1.7107207, 0.023870474, 0.12222739,
0.28521088, 0.49085197, 0.7113836, 0.9170247, 1.0800081, 1.178365, 0.012169055, 0.062310956, 0.14539917, 0.250234,
0.36266002, 0.46749485, 0.5505831, 0.600725, -2.7265654e-09, -1.3961223e-08, -3.2577745e-08, -5.6066757e-08, -8.1256616e-08, -1.04745624e-07,
-1.2336216e-07, -1.3459682e-07, -0.01216906, -0.062310982, -0.14539924, -0.25023413, -0.3626602, -0.46749508, -0.55058336, -0.6007253,
-0.023870474, -0.12222741, -0.2852109, -0.490852, -0.71138364, -0.9170247, -1.0800083, -1.1783651, -0.03465456, -0.17744671,
-0.41406208, -0.7126067, -1.0327691, -1.3313136, -1.5679291, -1.7107211, -0.04410688, -0.22584677, -0.52700096, -0.9069761,
-1.3144654, -1.6944405, -1.9955949, -2.1773348, -0.051864203, -0.26556772, -0.6196876, -1.0664911, -1.5456479, -1.9924514,
-2.3465714, -2.5602748, -0.05762842, -0.29508305, -0.68856, -1.1850214, -1.717432, -2.2138934, -2.6073704, -2.844825,
-0.061178003, -0.31325847, -0.7309714, -1.258012, -1.823216, -2.3502564, -2.7679696, -3.02005
};

int main(int argc, char* argv[])
{
  int N1 = 8;
  int N2 = 8;
  int M = 128;

  double tol=1e-7;

  int iflag=1;

  cout<<scientific<<setprecision(3);
  int ier;

  // malloc host arrays
  float *x, *y;
  complex<float> *c, *fk;
  checkCudaErrors(cudaMallocHost(&x, M*sizeof(float)));
  checkCudaErrors(cudaMallocHost(&y, M*sizeof(float)));
  checkCudaErrors(cudaMallocHost(&c, M*sizeof(complex<float>)));
  checkCudaErrors(cudaMallocHost(&fk, N1*N2*sizeof(complex<float>)));

  // malloc device arrays
  float *d_x, *d_y;
  cuFloatComplex *d_c, *d_fk;
  checkCudaErrors(cudaMalloc(&d_x, M*sizeof(float)));
  checkCudaErrors(cudaMalloc(&d_y, M*sizeof(float)));
  checkCudaErrors(cudaMalloc(&d_c, M*sizeof(cuFloatComplex)));
  checkCudaErrors(cudaMalloc(&d_fk, N1*N2*sizeof(cuFloatComplex)));

  // Making data
  for (int i = 0; i < M; i++) {
    x[i] = freqs1[i];
    y[i] = freqs2[i];
  }
  for(int i=0; i<N1*N2; i++){
    fk[i].real(0);
    fk[i].imag(0);
  }
  for(int i=0; i<M; i++){
    c[i].real(1);
    c[i].imag(0);
  }

  // Copy data to device memory, real users might just populate in memory.
  checkCudaErrors(cudaMemcpy(d_x, x, M*sizeof(float),cudaMemcpyHostToDevice));
  checkCudaErrors(cudaMemcpy(d_y, y, M*sizeof(float),cudaMemcpyHostToDevice));
  checkCudaErrors(cudaMemcpy(d_fk, fk, N1*N2*sizeof(complex<float>), cudaMemcpyHostToDevice));
  checkCudaErrors(cudaMemcpy(d_c, c, M*sizeof(complex<float>), cudaMemcpyHostToDevice));

  // construct plan
  cufinufftf_plan dplan;
  int dim = 2;
  int type = 1;

  int64_t nmodes[3];
  int ntransf = 1;
  nmodes[0] = N1;
  nmodes[1] = N2;
  nmodes[2] = 1;

  // Make Plan
  ier = cufinufftf_makeplan(type, dim, nmodes, iflag, ntransf, tol, &dplan, NULL);

  if (ier!=0){
    printf("err: makeplan\n");
    return ier;
  }

  // Set Non uniform points
  ier=cufinufftf_setpts(dplan, M, d_x, d_y, NULL, 0, NULL, NULL, NULL);
  if (ier!=0){
    printf("err: setpts\n");
    return ier;
  }

  // Execute the plan on the data
  ier=cufinufftf_execute(dplan, d_c, d_fk);

  if (ier!=0){
    printf("err: exec\n");
    return ier;
  }

  // Destroy the plan when done processing
  ier=cufinufftf_destroy(dplan);
  if (ier!=0){
    printf("err: destroy\n");
    return ier;
  }

  // Copy test data back to host and compare
  checkCudaErrors(cudaMemcpy(fk, d_fk, N1*N2*sizeof(complex<float>), cudaMemcpyDeviceToHost));

  complex<float> J = complex<float>(0,1)*(float)iflag;
  complex<float> ct = complex<float>(0,0);
  int m=0;
  int jt = M/2;          // check arbitrary choice of one targ pt
  for (int m2=-(N2/2); m2<=(N2-1)/2; ++m2)  // loop in correct order over F
    for (int m1=-(N1/2); m1<=(N1-1)/2; ++m1)
      ct += fk[m++] * exp(J*(m1*x[jt] + m2*y[jt]));   // crude direct
  printf("[gpu ] one targ: rel err in c[%ld] is %.3g\n",(int64_t)jt,abs(c[jt]-ct)/infnorm(M,c));

  /* Look at some other data */
  int cm1;
  printf("fk[0:%d]: ", N1);
  for(cm1=0; cm1<10; cm1++){
    printf("%f, ", (double)fk[cm1].real());
  };
  printf("\n");

  // Cleanup
  checkCudaErrors(cudaFreeHost(x));
  checkCudaErrors(cudaFreeHost(y));
  checkCudaErrors(cudaFreeHost(c));
  checkCudaErrors(cudaFreeHost(fk));
  checkCudaErrors(cudaFree(d_x));
  checkCudaErrors(cudaFree(d_y));
  checkCudaErrors(cudaFree(d_c));
  checkCudaErrors(cudaFree(d_fk));

  return 0;
}

% mkdir -p build && cd build
% cmake .. -DFINUFFT_USE_CUDA=on -DCMAKE_CUDA_ARCHITECTURES="86" -DCMAKE_BUILD_TYPE=relwithdebinfo
% make -j
% nvcc float_bug.cu -I../include -I../include/cufinufft/contrib/cuda_samples libcufinufft_static.a -lcuda -lcufft
% ./a.out
setup_spreader: warning, increasing tol=1e-07 to eps_mach=1.19e-07.
[gpu ] one targ: rel err in c[64] is 1.71e+03
fk[0:8]: 14.591318, 19.666874, 23.890945, 21.967796, 20.733913, 21.967796, 23.890951, 19.666851, 19.666849, 22.994013,

garrettwrong commented 1 year ago

Hi, using the finufft master code does seem to resolve the issue on our platform as well.

Not worried about the error. The issue was the junk results in floats (often times was yielding nans as well). Right now we have to cast everything to doubles to run with cufinufft. It will be nice to resolve that when finufft is release and we update things.

@ahbarnett , it seems this can be resolved as fixed upstream in finufft, thanks for following up. Closing.

flatironinstitute / cufinufft

Help with possible pts dependent bug #144