ROCm / rocFFT

Next generation FFT implementation for ROCm
https://rocm.docs.amd.com/projects/rocFFT/en/latest/
Other
175 stars 84 forks source link

[Bug]: incorrect results in 3d complex-complex inplace FFT #464

Closed t1nux closed 8 months ago

t1nux commented 8 months ago

Problem Description

Hello everyone

First of all, I am using Arch Linux, and I know it is not an officially supported distro. But I cannot just take down one of the 3 Arch system I tested this with. I have example code that can easily be executed, so please, do not simply discard this for that reason.

I am working with a potentially large complex 3D array (time t, space y, space x), and I need to apply different operations to it. These operations are mostly applied in

As is turns out, the 3D FFT, but also a 2D+1D FFT, does not produce the correct result for all sizes of the complex 3D array.

Example code to demo the bug can be found here.

I have already posted this on community.amd.com, but there was no answer. There is more info though.

Next to the specified system, I also tried this on 2 other systems with a Radeon VII Pro and an RX 6900 XT, respectively. The bug was also reproducible there.

Please let me know if there is anything else you need. Thanks!

Operating System

Arch Linux

CPU

AMD Ryzen 5 3600X 6-Core Processor

GPU

AMD Radeon Pro VII, AMD Radeon RX 7900 XTX

ROCm Version

ROCm 6.0.0

ROCm Component

rocFFT

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 5 3600X 6-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 5 3600X 6-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   3800
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            12
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    65764048(0x3eb7ad0) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65764048(0x3eb7ad0) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    65764048(0x3eb7ad0) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1100
  Uuid:                    GPU-d0d26da8a5f0688c
  Marketing Name:          AMD Radeon RX 7900 XTX
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      6144(0x1800) KB
    L3:                      98304(0x18000) KB
  Chip ID:                 29772(0x744c)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2482
  BDFID:                   12032
  Internal Node ID:        1
  Compute Unit:            96
  SIMDs per CU:            2
  Shader Engines:          6
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 550
  SDMA engine uCode::      19
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1100
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

Additional Information

No response

evetsso commented 8 months ago

Hi @t1nux , thanks for the bug report. Thanks especially for providing a standalone test program, which is very helpful.

Your test procedure appears to be going through a plotting step and comparing the plots for accuracy. But are you able to instead compare data directly in your reproducer? That would help confirm that the problem is indeed in the rocFFT output as opposed to at some later step.

I've modified your reproducer to compare the z3d and z21d output directly: https://github.com/evetsso/roc_fft_bug/commit/be19acf2afd6666a0beea72d17b17b0fa8dbfe56

This compares the two results in a similar way to to what our rocfft-test accuracy test does when we check against FFTW as a reference implementation. I'm computing and printing the L2 and L-infinity norms of the difference between the two buffers. I've also removed the rocThrust dependency for simplicity, since this small reproducer does not really need it.

My observed results:

l2 difference: 6.952651e-12
l-inf difference: 8.747490e-13

Which I think is a tolerable difference - the two FFTs are doing operations in a different order so we wouldn't expect exactly the same results between the two.

Can you confirm that you observe similar results when directly comparing the data in the reproducer?

t1nux commented 8 months ago

Hi @evetsso, thank you for your reply and for the direct comparison code.

I should have been more clear when things actually fail, it was only written in the .ipynb notebook. Sorry for that! I will quickly mention a case that will show an issue also when I run your code. If you change int Nt = 1<<4; to int Nt = 1<<8; in line 55, it is a case where the 2D+1D FFT gets the correct result, but the 3D FFT does not. For that case, your code gives me:

268435456 values
4 GB
l2 difference: 8.384832e+01
l-inf difference: 1.597270e+00

Also, here is a quick list of what happens for rocfft in comparison to NumnPy's 3D FFT:

# not working for 3d FFT
Nt, Ny, Nx = 2**4, 2**9, 2**13
Nt, Ny, Nx = 2**4, 2**13, 2**9
Nt, Ny, Nx = 2**8, 2**8, 2**12

# not working for both 3d and 2D+1D
Nt, Ny, Nx = 2**8, 2**10, 2**10
Nt, Ny, Nx = 2**8, 2**12, 2**8

# working
Nt, Ny, Nx = 2**4, 2**9, 2**9
Nt, Ny, Nx = 2**4, 2**9, 2**10
Nt, Ny, Nx = 2**4, 2**9, 2**11
Nt, Ny, Nx = 2**4, 2**9, 2**12
Nt, Ny, Nx = 2**4, 2**12, 2**9
Nt, Ny, Nx = 2**4, 2**10, 2**10
Nt, Ny, Nx = 2**8, 2**9, 2**9
Nt, Ny, Nx = 2**8, 2**9, 2**10
Nt, Ny, Nx = 2**8, 2**8, 2**11
Nt, Ny, Nx = 2**4, 2**12, 2**8

I do not think this list is exhaustive. Also, in case of both not working, we have to compare with FFTW, NumPy, or something else. That's what I will do now, write something that directly compares with FFTW. I should have done this in the first place, but I'm really quite slow in C, so don't hold your breath. I'll be back...

EDIT: Corrected typos.

t1nux commented 8 months ago

OK, as promised, here is the direct comparison with FFTW. It should only be necessary to change Nt, Ny, and Nx to observe the issue.

Here are some examples

Please also look at the table in my previous post with more failure examples. I am sure there are different cases, too.

@evetsso Thanks for the code to compare the arrays.

EDIT: I tested those 3 cases above on the systems with the RX 6900 XT and the Radeon VII Pro, with the same behavior. I don't have access to the system with the RX 7900 XTX from home, but I'll test tomorrow.

EDIT2: Same on the RX 7900 XTX.

t1nux commented 8 months ago

For the case Nt = 2^8, Ny = 2^12, Nx = 2^8, I've added a plot to illustrate what is happening qualitatively.

What you see in the image below is the following:

Since I'm starting with a 3D step function, I should get a 3D sinc (sin(x)/x) in the FFT, and the images should show 2D sinc functions. This seems correct in the case of FFTW (bottom row). Because of the large Ny, the left and right images in the bottom row look weird, but they're fine when zooming in. image

evetsso commented 8 months ago

Ok, I see what's going on. When you're computing the maximal work buffer size for all of the plans, you're casting the inputs to int. But the number of bytes required is > 4 GiB, and the value is truncated before it reaches std::max.

Redoing the size computation like this fixes the problem:

    rocfft_plan_get_work_buffer_size(plan3d3d_ip_f, &work_buffer_size);
    rocfft_buffer_size = std::max<size_t>(rocfft_buffer_size, work_buffer_size);
    rocfft_plan_get_work_buffer_size(plan3d3d_ip_b, &work_buffer_size);
    rocfft_buffer_size = std::max<size_t>(rocfft_buffer_size, work_buffer_size);
    rocfft_plan_get_work_buffer_size(plan3d2d_ip_f, &work_buffer_size);
    rocfft_buffer_size = std::max<size_t>(rocfft_buffer_size, work_buffer_size);
    rocfft_plan_get_work_buffer_size(plan3d2d_ip_b, &work_buffer_size);
    rocfft_buffer_size = std::max<size_t>(rocfft_buffer_size, work_buffer_size);
    rocfft_plan_get_work_buffer_size(plan3d1d_ip_f, &work_buffer_size);
    rocfft_buffer_size = std::max<size_t>(rocfft_buffer_size, work_buffer_size);
    rocfft_plan_get_work_buffer_size(plan3d1d_ip_b, &work_buffer_size);
    rocfft_buffer_size = std::max<size_t>(rocfft_buffer_size, work_buffer_size);

Note that rocfft_execute is actually failing in your example, but your example code is not checking it.

I believe this resolves your problem, so I'm closing this issue. Please feel free to comment and/or reopen if you need anything else.

t1nux commented 8 months ago

@evetsso Thank you so much! This indeed fixes the issue, and I see why this is failing now. Thank you for your explanations.

So, next to your changed code for rocfft_buffer_size, I added

#define ROCFFT_ASSERT(x) (assert((x) == rocfft_status_success))

and I'm using it whenever I call rocfft_execute() like this:

ROCFFT_ASSERT(rocfft_execute(plan3d3d_ip_f, (void**) &z_d, nullptr, rocfft_info));

My apologies for posting this as a bug while it was actually an error in my code!