Metal backend - Githubissues

philipturner commented 2 years ago

I've been working on making GROMACS faster on the Apple GPU, and I recently came across a need to work with VkFFT. The software package was using clFFT, which couldn't compile on Apple platforms. This meant certain work couldn't be offloaded to the GPU. I replaced clFFT with VkFFT and your software works great - very easy to implement and robust. I tried tuning performance settings, but it makes no difference for my use case (3D FFT, R2C/C2R).

As I'm profiling GROMACS, there are very dismal results regarding GPU acceleration. On small simulations (~10-100k atoms), offloading most work to GPU takes as long as CPU or sometimes is slower. I only saw a confident advantage at ~2 million atoms, where GPU was twice as fast. I fear this is because of (a) CPU-side overhead of rapidly communicating with the GPU and (b) that Apple's OpenCL doesn't support simdgroup reductions. In theory, the GPU should be 10-100x faster than CPU.

To fix this, I've been researching making a Metal backend for hipSYCL, which transpiles SYCL C++ code into Apple AIR. However, it will use Metal for API calls instead of OpenCL. This may decrease CPU-side overhead by an order of magnitude. VkFFT only supports OpenCL, so marshalling data between Metal and OpenCL may cause a severe bottleneck.

Would you accept a contribution where I make a Metal backend for VkFFT? Faster GPU-side performance isn't why I'm doing this, but I'm still curious: would utilizing subgroup/simdgroup reductions (e.g. simd_sum, WaveActiveSum) measurably boost performance?

DTolm commented 2 years ago

Hello,

I actually got an M1 MacBook last week before this issue and already ported VkFFT to Metal. There are some things not implemented/not working:

I did not implement MTL Library binary caching as the only way I found to get binaries was to save them on disk with URL from a dynamic library. I would rather not perform any disk writes inside the library, but I guess this is the only available API from Apple.
No double precision, which is a known thing.
No way to properly get max number of threads in code before generating ComputePipelineState. So while Metal can have up to 1024 max threads, I artificially limit it to 256 (like in OpenCL), as there are kernels in VkFFT that actually get maxTotalThreadsPerThreadgroup reported as lower than 1024.

Besides binary caching, all the main functionality is working, performant and tested (on implemented tests).

I will make some Metal/OpenCL big benchmark runs soon.

As for subgroup/simdgroup reductions, FFT algorithms do not use reductions, but subgroups can be used for small FFTs to exchange data between threads slightly better than shared (threadgroup) memory. Threadgroup memory bandwidth turned out to be the main bottleneck of M1 SoC, so this may help (but probably not much as simd operations of other vendors still use shared memory).

If you find some bugs/have some improvements - feel free to submit them!

Best regards, Dmitrii

philipturner commented 2 years ago

No way! Why did you choose to go with Metal? I can definitely help you out with any issues you face.

Also, a possible solution for double precision: https://github.com/philipturner/metal-float64. I'm not sure whether I will ever complete that library though.

I did not implement MTL Library binary caching as the only way I found to get binaries was to save them on disk with URL from a dynamic library. I would rather not perform any disk writes inside the library, but I guess this is the only available API from Apple.

I did some extensive profiling once, and a Metal system cache maps shader source code files to binaries. When you compile a shader from source, it often uses the system's shader cache so you often don't need to compile binaries. I tried using a MTLBinaryArchive but found it mostly not useful. To generate an actual binary .metallib, use Metal command-line tools. This could enable an optimization on macOS that isn't available on iOS.

However, as long as you generate the exact same source character-for-character, you can get good performance just "compiling it from source" every time. I found that the shader took very little time to load from the system Metal Shader Cache, but can't recall the exact latencies (100's of microseconds to 1's of milliseconds).

No way to properly get max number of threads in code before generating ComputePipelineState. So while Metal can have up to 1024 max threads, I artificially limit it to 256 (like in OpenCL), as there are kernels in VkFFT that actually get maxTotalThreadsPerThreadgroup reported as lower than 1024.

I can show you the full specification of how threadgroup size scales with register usage. If you don't mind, I basically reverse-engineered the entire Apple GPU architecture. This might help with tuning: https://gitlab.com/gromacs/gromacs/-/issues/4615.

TL;DR - from https://rosenzweig.io/blog/asahi-gpu-part-3.html: Number of threads should theoretically be constant for the same compute kernel, except when you pre-allocate different amounts of threadgroup memory not statically declared in the shader. If variable threadgroup memory is causing issues, I can reverse-engineer the M1 GPU's algorithm for changing threadgroup size based on threadgroup memory.

(16-bit) Registers	Threads
<= 104	1024
112	896
120, 128	832
136	768
144	704
152, 160	640
168-184	576
192-208	512
216-232	448
240-256	384

In general, Rozenweig provided a lot of helpful insights into the A14/M1 GPU architecture, which I found extremely helpful. My personal experience built on her work, so that I've almost fully reverse-engineered the entire chip.

philipturner commented 2 years ago

Threadgroup memory bandwidth turned out to be the main bottleneck of M1 SoC, so this may help

This must be why the A13 and earlier performed terribly with matrix multiplications for machine learning (25% ALU utilization). Apple introduced a dedicated hardware instruction (simdgroup_matrix) to coordinate communication inside a simdgroup using threadgroup memory, boosting ALU utilization to 80% and making the M1 usable for TensorFlow acceleration.

For reference, AMD reaches 50% utilization without a specialized hardware instruction. This must mean threadgroup memory on M1 is 1/2 as fast compared to other GPUs.

philipturner commented 2 years ago

One thing I'm looking for is the ability to take a MTLComputeCommandEncoder as launch params, and encode VkFFT into that encoder. This would help with one optimization I'm thinking of for GROMACS, which utilizes .concurrent Metal command encoders. You can accomplish this by swapping the blit pass for a compute shader that writes 16-byte chunks of memory per GPU thread: https://github.com/philipturner/metal-experiment-1/blob/fdcf8e6fa9ce722c2e10e1bdc214606d12ebe8f5/Sources/MetalExperiment1/Shaders/elementwise_f32_i32.metal#L11-L19

DTolm commented 2 years ago

Why did you choose to go with Metal?

There were only three APIs not implemented that people use: DirectX, OpenGL and Metal. More people asked for Metal and there was a good deal on an M1 Pro machine at a local store.

Metal system cache maps shader source code files to binaries.

I figured that out by measuring the timings of the code. Though it would be nice to have a standardized solution as with other APIs. Maybe Apple one day will add an option to get a binary from a Library to a string.

I can show you the full specification of how threadgroup size scales with register usage

This can be useful, but I do not expect significant gains (probably ~10-20% at max for rare systems). VkFFT can scale the code for any number of available threads, which is why it will have a different number of registers based on how many threads are available. It was just an unexpected discovery I found when I tried to get max number of threads from compiling an empty kernel and using it.

A13 and earlier performed terribly with matrix multiplications

Shared memory takes a lot of energy, which is why it is probably not as fast as on discrete GPUs. I get almost linear performance scaling between algorithms: regular Stockham - Rader FFT- Bluestein have roughly 1:2:4 shared memory transfers and this is what I see on achieved timings. Also, this is why Rader Multiplicative algorithm that does convolution by a direct multiplication with a kernel located in the threadgroup memory works really bad - it even has almost 100% shared memory utilization on Nvidia GPUs, which have the fastest shared memory.

threadgroup memory on M1 is 1/2 as fast compared to other GPUs

Probably even less. How does simdgroup shuffling works on Apple hardware, does it still use threadgroup memory like other vendors?

One thing I'm looking for is the ability to take a MTLComputeCommandEncoder as launch params, and encode VkFFT into that encoder

This is exactly how VkFFT works now.

philipturner commented 2 years ago

This is exactly how VkFFT works now.

I was just skimming through your code and found this: https://github.com/DTolm/VkFFT/commit/ba7001c30cd458d2196c0bdabf5caa52681016fc#r86062493. But now I see: https://github.com/DTolm/VkFFT/commit/ba7001c30cd458d2196c0bdabf5caa52681016fc#r86062979.

Probably even less. How does simdgroup shuffling works on Apple hardware, does it still use threadgroup memory like other vendors?

If other vendors use threadgroup memory for it, then I don't know whether Apple does. However, I will say that it's the fastest way to communicate between threads. When launching the Apple7 (A14/M1) GPU architecture, that's the first time any Apple GPU could perform any SIMD-scoped reductions. It was also the same generation they added simdgroup_matrix, so I suspect they made both hardware instructions incredibly fast. One generation prior (Apple6), they first added SIMD-scoped permute. Have you read the Metal feature set tables?

philipturner commented 2 years ago

Shared memory takes a lot of energy, which is why it is probably not as fast as on discrete GPUs.

If you've read the GitLab thread I linked, Apple went to the extremes for power efficiency on this GPU. They sacrificed maximum FLOPS performance by allocating 4x the number of transistors to register file as other vendors, completely saturating ALU utilization. Could you elaborate a bit more on how power-hungry the threadgroup/shared memory is? If possible, on the transistor/hardware level.

More people asked for Metal

Did anyone else besides me ask for Metal support? How in-demand was it?

there was a good deal on an M1 Pro machine at a local store.

How many GPU cores? I presume it's 14, if CPU is 8-core. I can add a 32 GPU core statistic to your benchmarks.

DTolm commented 2 years ago

I was just skimming through your code and found this: https://github.com/DTolm/VkFFT/commit/ba7001c30cd458d2196c0bdabf5caa52681016fc#r86062493

transferData functions are only used for lookup table initialization during the plan initialization. During the execution VkFFT just appends premade ComputePipelineStates to ComputeCommandEncoder.

I don't know whether Apple does

Nvidia is very misleading on this as well. Their profiler says that it uses shared memory.

Have you read the Metal feature set tables?

Some of them. Anyway, I don't use simd instructions in VkFFT, but I have other projects with similar structures that can test it. I will update on this later.

If possible, on the transistor/hardware level

L1/Shared memory has a bank structure that can serve multiple threads at the same time. I would expect the circuitry needed to do so at high clocks will make it also the most power-consuming. It is also the most expensive memory on-chip besides the register file.

Did anyone else besides me ask for Metal support? How in-demand was it?

There were a few more people and I wanted to play with this SoC myself. No one has asked for OpenGL/DirectX so far.

How many GPU cores? I presume it's 14, if CPU is 8-core. I can add a 32 GPU core statistic to your benchmarks.

14, the low-end M1 pro. The benchmark 1000 tests all sequences up to 4096, similar to the plots of A100 and MI250 in repo. It can be interesting to see how it compares to 14 core one.

philipturner commented 2 years ago

From the README:

Test mobile GPUs (they should work)

I can now test this on iOS GPUs! OpenCL doesn't exist on iOS, but Metal does. I have an A15 (5 GPU cores) and an A10X (ancient) and could try testing both.

DTolm commented 2 years ago

I can now test this on iOS GPUs!

There have been reports from people using VkFFT on Android and I have not heard about bugs so far. I just haven't done the tests from the test suite myself.

philipturner commented 2 years ago

Do you have an iPhone or iPad? If so, I could show you how to set up an Xcode project to regularly test it. If not, I'd be happy to regularly test it for you. You can also use an iOS simulator, which should provide very similar results to real-world because M1 has the same CPU/GPU as iOS.

philipturner commented 2 years ago

14, the low-end M1 pro. The benchmark 1000 tests all sequences up to 4096, similar to the plots of A100 and MI250 in repo. It can be interesting to see how it compares to 14 core one.

From the command-line interface, could you provide me with the exact commands you use to benchmark, verbatim? (e.g. ./Vulkan_FFT -option -option -option 1000). Then I can profile and send results. Also, any way to run these benchmarks through straight C++ code, so I can test them on iOS?

DTolm commented 2 years ago

Making a CI is an ongoing process, but I will first set it up for desktop GPUs. And it will have to wait till I actually finish the publication about VkFFT.

From the command-line interface, could you provide me with the exact commands you use to benchmark, verbatim?

./Vulkan_FFT -vkfft 1000

or add ./Vulkan_FFT -vkfft 1000 -o output.txt

to save to file

Also, any way to run these benchmarks through straight C++ code, so I can test them on iOS?

Hm, replacing lines 1003-1025 in Vulkan_FFT.cpp with

VkFFTResult resFFT = launchVkFFT(&vkGPU, 1000, file_output, output, 0);

will run the test at launch and save results to result.txt

philipturner commented 2 years ago

Currently running the benchmarks (full fans blowing, 72 Celsius), but I’m thinking about making this usable from Swift for iOS developers (which can’t import C++ headers yet, just C). All we need is for this to be importable by C code. If we can pass a raw void pointer into the MTL::Buffer in LaunchParams etc, that should be all that’s necessary. Perhaps you could make a compile option that changes to void* or Objective-C API when targeting straight C code? Then, add a section to the Docs PDF describing how to use from Swift. I could validate this works by creating a Swift package that uses VkFFT.

This can be useful, but I do not expect significant gains (probably ~10-20% at max for rare systems). VkFFT can scale the code for any number of available threads, which is why it will have a different number of registers based on how many threads are available. It was just an unexpected discovery I found when I tried to get max number of threads from compiling an empty kernel and using it.

Does 3D-FFT R2C/C2R count as one of these rare systems? FFT size: 32-256 in all dimensions.

DTolm commented 2 years ago

VkFFT is a C code for all other backends, I just couldn't find pure C bindings for metal - only C++ and objective-C. I went with C++ as it looked the easiest and close to what I already have. I guess it might be an option to switch to objective-C, I have no idea as I have not coded for Apple before.

Does 3D-FFT R2C/C2R count as one of these rare systems? FFT size: 32-256 in all dimensions.

Maybe, so far I only know about 2^22 system.

philipturner commented 2 years ago

I’ve been running the benchmark for almost an hour; how long should it take? Also, does the Metal backend support half-precision FFTs?

Temp shot up to 80 Celsius and fans suddenly blew super loud when it switched to “32768 Buffer”.

I can help you use Objective-C. I’m not a fan of ObjC, but that might be the easiest way to go about this. Also eliminates the extra dependency on metal-cpp. The biggest concern is whether you can explicitly cast metal-cpp pointers to their ObjC counterparts.

philipturner commented 2 years ago

Maybe, so far I only know about 2^22 system.

If you can’t squeeze that extra 10-20%, no big deal. FFTs are only a fraction of the computational load in molecular simulations anyway.

DTolm commented 2 years ago

I’ve been running the benchmark for almost an hour; how long should it take? Also, does the Metal backend support half-precision FFTs?

It will run until first number column reaches 4096. I have made full run on 14 core M1:

M1_pro

For sequences from 2048-4096, Rader algorithm gets more computationally expensive than Bluestein's, which switches to two uploads. I will modify it in the next update.

metal-cpp

From what I see metal-cpp is just a header collection of bindings itself.

does the Metal backend support half-precision FFTs?

I have not enabled half precision FFTs anywhere outside Vulkan backend. Not sure if half precision is useful for FFTs, but I will update other backends for it some time in the future.

philipturner commented 2 years ago

The easiest method for long-term maintenance might be to store the Metal API types as Objective-C types in the public interface, but internally reinterpret-cast them to metal-cpp types. That brings better maintainability of using C++ bindings internally, while the public interface is compatible with Swift, ObjC, and C++. GROMACS can use your single-header library and include metal-cpp in their own fashion. I’ll make a PR when I’ve laid out and tested this change.

Also I’ll make a Swift package repo that copies the VkFFT and metal-cpp headers, with instructions that users should fork it and replace with the most recent version of each header. Would you be okay with linking that as the “Swift bindings” on the VkFFT README?

Also, does the benchmark dynamically scale the number of iterations based on GPU size? I worry about small iOS GPUs taking 6 times as long as my M1 Max.

Are you using the metal::precise namespace for more precise versions of division and transcendental functions? The Metal Shading Language specification shows exact error ranges for the fast and precise versions, uses fast math by default. BTW M1 has hardware acceleration for sine/cosine, recip, and rsqrt. FMA seems to be 2 clock cycles. Use 16-bit integers if possible instead of 32-bit because they’re double throughput; avoid 8-bit integers.

philipturner commented 2 years ago

Benchmark results: 32-core M1 Max in high-power mode, AC power, no external displays connected, starting from cold but fans turn on in ~1-5 minutes: output.txt

Swift bindings are much harder to generate than I thought. Using metal-cpp is a non-trivial task, and it won't compile in Xcode with a project targeting Swift through Objective-C headers. I tried rewriting all of the internals in Objective-C but it's too tedious for me. There's also issues with reference-counting, where in Objective-C you need a NSMutableArray* while in C++ you can have a MTL::Buffer**.

I don't think I have the patience to fix this. But if you decide to, my recommendation: make a separate compile path that makes the headers compatible with Swift. You only need to change each MTL::Buffer** in the public interface to void**, then cast back to MTL::Buffer** internally. On the Swift side, someone can write Unmanaged<MTLBuffer>.passUnretained(...).toOpaque() and pass in the raw C pointer. Next, fix the compilation errors with the metal-cpp dependency. You don't have to change much internal code, just create an extra macro and say that to use this through Objective-C or Swift, you need to define that macro.

Either Swift or Objective-C is necessary to test on iOS. I can create a Swift iOS app if you make VkFFT usable from Swift.

DTolm commented 2 years ago

Benchmark results: 32-core M1 Max in high-power mode, AC power, no external displays connected, starting from cold but fans turn on in ~1-5 minutes: output.txt

M1_max

All results scale approximately 2x from M1 Pro, which is expected.

Either Swift or Objective-C is necessary to test on iOS. I can create a Swift iOS app if you make VkFFT usable from Swift.

I will do this at some point, but not right now.

philipturner commented 2 years ago

Weird that max bandwidth is 340 GB/s, given that the chip supports 400 GB/s. However, your M1 Pro maxed out at 170 GB/s, which is exactly half.

I will do this at some point, but not right now.

When you do, would you mind pinging me in this thread?

DTolm commented 2 years ago

Weird that max bandwidth is 340 GB/s, given that the chip supports 400 GB/s. However, your M1 Pro maxed out at 170 GB/s, which is exactly half.

The actual bandwidth is often lower than reported in the specification. The same is true for AMD/Nvidia - they have 1.3TB/s real bandwidth, while the specification of HBM2 memory says it is 1.5TB/s. Maybe the specification shows higher clocks than standard results, I don't know.

DTolm commented 2 years ago

Also if you check systems that can fit in L2 cache, like 128^3 (./Vulkan_FFT -vkfft 3), you will see that they can operate at even higher bandwidths - I get 320GB/s for 128^3 due to cache reuse between iterations.

DTolm commented 2 years ago

When you do, would you mind pinging me in this thread?

Sure. But no promises on when as I know nothing about Swift, maybe someone else will do this faster.

philipturner commented 1 year ago

For a reason unrelated to VkFFT, it seems like I will finish the FP64 emulation library for Apple silicon. I also discovered that you can access hardware FP64 on AMD GPUs through AIR. I'm prototyping a modern OpenCL 3.0 driver for macOS, MoltenCL, which uses both mechanisms to enable double in OpenCL C.

Since Apple's OpenCL 1.2 driver stays deprecated while MoltenCL gets updated, my driver may become used quite extensively in the future. Therefore, we may want to anticipate testing VkFFT's OpenCL backend against MoltenCL. However, in the near term, is it possible to embed a final-product FP64 emulation library into the VkFFT Metal backend?

DTolm commented 1 year ago

is it possible to embed a final-product FP64 emulation library into the VkFFT Metal backend

This will depend on how it is implemented. Most math is abstracted from the FFT algorithms - like multiplications, fmas etc, so replacing the code will be fairly easy. There are some real additions/assignments that are not abstracted, but I can fix them. If the driver will understand operations written like dp_a = dp_b_cmplx.x + dp_b_cmplx.y then it will require almost no work.

philipturner commented 1 year ago

For Metal, it will be a drop-in library defining a custom struct, which behaves like a double. For example:

struct metal_double {
  // ...
};

metal_double a = metal_double(1) + metal_double(2);
metal_double b = a * a;

For OpenCL, you can use the straight double keyword instead. Every FP64 operation defaults to a GPU-side function call, which you can force-inline depending on performance.

philipturner commented 1 year ago

I recall that you made certain optimizations for M1 after implementing the Metal backend. Were those optimizations backported to OpenCL? For reference, I'm creating a MoltenCL and may use VkFFT through its OpenCL runtime. I want the same performance as if I used Metal directly.

DTolm commented 1 year ago

I have not yet made any specific M1 optimizations - currently, they are usual optimizations derived from the hardware specifications. So OpenCL should be similar in performance so far. I will make M1 optimizations in the future for balancing prime factor algorithms, but this requires more testing.

philipturner commented 1 year ago

L1/Shared memory has a bank structure that can serve multiple threads at the same time. I would expect the circuitry needed to do so at high clocks will make it also the most power-consuming. It is also the most expensive memory on-chip besides the register file.

I've been reading about the structure of generic GPU architectures, and learned that shared/threadgroup memory uses banks. Did your investigations leave any clues into the bank size/stride on M1, or other μarch details? If not, I may need to test that myself.

Also, what is the arithmetic intensity and/or ALU utilization of VkFFT kernels? I'm making progress on MetalFloat64 and expect a performance ratio of ~1:30-40 between FP32 and FP64, for FMA. What is the distribution of operation types - mostly FMA, or many divisions/roots/transcendentals? I worry about under-optimizing the latter.

Is VkFFT double precision currently useful on consumer Nvidia GPUs with 1:32? I am seeking to use VkFFT double precision to run INQ on the M1 GPU.

DTolm commented 1 year ago

I've been reading about the structure of generic GPU architectures, and learned that shared/threadgroup memory uses banks. Did your investigations leave any clues into the bank size/stride on M1, or other μarch details? If not, I may need to test that myself.

Most GPUs use 32 banks x 4 byte shared memory structure so I use this number. Not all APIs provide this information, so it is a hard forced constant for now - this works as the main target of bank conflict optimization is to break power of 2 strides (and it is safe to assume that number of ports is a power of 2).

Also, what is the arithmetic intensity and/or ALU utilization of VkFFT kernels? I'm making progress on MetalFloat64 and expect a performance ratio of ~1:30-40 between FP32 and FP64, for FMA. What is the distribution of operation types - mostly FMA, or many divisions/roots/transcendentals? I worry about under-optimizing the latter.

ALU utilization is GPU dependent, for powers of 2 (and then decreasing as you add more different primes) it is really low, so at least in this case FP64 should be useful. There are no operations other than FMA in VkFFT (or just regular ADD, SUB and MUL). All transcendental will be used with LUT in double precision.

Is VkFFT double precision currently useful on consumer Nvidia GPUs with 1:32? I am seeking to use VkFFT double precision to run INQ on the M1 GPU.

Yes, it is not that bad, for small lengths (up to 100) radix decomposition is 1/2 or 1/3 of peak bandwidth on 3070, Rader is 2x-3x slower than that. For bigger powers of 2 sequences it falls to 1/4 of peak bw.

philipturner commented 1 year ago

I'm getting some more concrete estimates regarding FMA performance for FP32:FP64. I thought of some intermediate precisions with less mantissa, but drastically higher performance. Any chance VkFFT could use float59_t for intermediate sums, converting back to float64_t before writing to memory?

Precision	Exponent/Mantissa	GFLOPS FMA	GFLOPS ADD
FP32	8/24	10400	5200
FP43	15/32	~350 (1:30)	~1000 (1:10)
FP59	15/48	~250 (1:40)	~700 (1:15)
FP64	11/53	~150 (1:70)	~400 (1:25)

For M1 Max, which has 400 GB/s = 50 billion 64-bit words/second bandwidth.

Also how can I determine a more fine-grained distribution of the number of ADD vs MUL vs FMA instructions? Do you have performance data about GPUs with similar bandwidth, but different FP64 compute power? I could extrapolate the performance delta based on predicted FP64 emulation performance.

DTolm commented 1 year ago

Any chance VkFFT could use float59_t for intermediate sums, converting back to float64_t before writing to memory?

The support for different memory/calculation precision is already implemented in VkFFT, so this can be enabled.

Also how can I determine a more fine-grained distribution of the number of ADD vs MUL vs FMA instructions?

The easiest way is to use Nvidia ncu profiler.

Do you have performance data about GPUs with similar bandwidth, but different FP64 compute power?

There is this old benchmark plot (current VkFFT version will have better results), but the performance is really different between different sequences. The best way will be just to try how emulation works.

philipturner commented 1 year ago

After some thorough analysis of matrix multiplication kernels, I found the M1's shared memory bandwidth. 64 B/cycle, half that of AMD and NVIDIA. The global/L1/L2/SLC memory bandwidth is also 64 B/cycle, issuing smaller transactions. Before M1, Apple's architecture heavily emphasized 16-bit data types. If one entire warp writes 16-bit values to memory, that will take 64 bytes. AMD and Nvidia do not have 16-bit granularity (AMD uses 2x16 vectors, Nvidia uses entirely 32-bit).

Memory cache line seems to be 128 B. L1D and L1I are both 12 KB, ~1/3 that of modern AMD/NVIDIA GPUs. Register file is 384 KB, bringing the largest register:shared memory ratio of any GPU. However, threadgroup bandwidth means paging to register file won't boost performance. I strongly recommend using 32-bit SIMD shuffles to boost bandwidth by 4x (256 B/cycle, faster than Nvidia).

philipturner commented 1 year ago

More statistics to help you optimize threadgroup size (sorry for writing so much). You can override threadgroup size, but it just pages more/less registers to the stack. These are native numbers; Metal provides slightly different ones. For example, it wouldn't restrict max threadgroup size at 72 simds/core.

32-bit Registers	Occupancy	Threadgroup Size	Threadgroup Memory
32	96 simds	1024	20 KB
36	84 simds	896	20 KB
40	72 simds	768	20 KB
48	64 simds	1024	32 KB
52	56 simds	896	32 KB
56	52 simds	832	32 KB
64	48 simds	768	32 KB
68	44 simds	704	32 KB
76	40 simds	640	32 KB
84	36 simds	576	32 KB
96	32 simds	512	32 KB
108	28 simds	448	32 KB
128	24 simds	384	32 KB

DTolm commented 1 year ago

Thanks for this useful information! The SIMD shuffle version of VkFFT is planned, but first I need to complete a big restructuring of the code that has been planned for a long time.

philipturner commented 1 year ago

Also, if you're multiplying by prime integers to address into threadgroup memory, that may harm performance. Left shifts (<<1...4) have 1-cycle throughput (Int16, ILP=~1.5, Int32, ILP=4) because they're a fast-path for aligned addressing. Generic IMUL32 and BITSHIFT has 4-cycle throughput. With SIMD shuffles you can reach across lanes without multiplication.

If you're optimizing VkFFT for M1, the instruction throughput tables on metal-benchmarks are your go-to guide. Also read up on the ALU layout. You may want to fork that repo to preserve the information!

philipturner commented 1 year ago

You don't need Metal to access the OpenCL simdgroup instructions. I'm making a dedicated header, although you can also just extract a few needed function declarations and manually copy them into shader code. Any chance the M1 OpenCL backend can incorporate the header? Make sure it's not exposed on x86 macOS.

DTolm commented 1 year ago

I still have not came up with the general algorithm for simdgroup data sharing for FFT - I know how to do it for powers of two, but these cases are already not computationally expensive. So this is not the current priority.

philipturner commented 1 year ago

What if you find prime numbers that fit inside the SIMD width? For example, you might subdivide an M1 SIMD-group as follows:

Factor	Number Subdivisions	Subdivision Size	Active Threads in Subdivision
2	16	2	2
3	8	4	3
4	8	4	4
5	4	8	5
6	4	8	6
7	4	8	7
8	4	8	8
9	2	16	9
16	2	16	16
17	1	32	17
32	1	32	32

This way, you could trivially coordinate actions among threads. This might help if certain indexing operations require knowing where the first thread is. You simply logical-and the subdivision size minus 1. If integer modulus is not a bottleneck, you might instead do:

Factor	Number Subdivisions	Subdivision Size	Idle Threads on the Right End of SIMD
2	16	2	0
3	10	3	2
4	8	4	0
5	6	5	2
6	5	6	2
7	4	7	4
8	4	8	0
9	3	9	5
16	2	16	0
17	1	17	15
32	1	32	0

I still have not came up with the general algorithm for simdgroup data sharing for FFT

What exactly is the problem? What conditions must such a general algorithm satisfy?

gsgou commented 1 year ago

@philipturner do you have any sample how to use vkFFT in an iOS app?

philipturner commented 1 year ago

vkFFT's API uses the metal-cpp types for Metal buffers and Metal devices, but it should be compatible with iOS. You should be able to use C++ code from an Xcode project, and bridge it to Swift with some C bindings.

https://developer.apple.com/metal/cpp/

You will likely need to use UnsafeMutableRawPointer and Unmanaged, to send/receive object references through the C API.

https://developer.apple.com/documentation/swift/unmanaged

I recommend profiling the Accelerate vDSP FFT against vkFFT, and making sure vkFFT is faster. There are frequently cases where GPU-accelerating (or even running on multiple CPU cores) makes an app slower.

https://developer.apple.com/documentation/accelerate/fast_fourier_transforms

DTolm / VkFFT

Metal backend #91