Porting SIMD plugins - Githubissues

yuygfgg commented 3 months ago

I'm also interested in porting VS plugins to macos and linux, especially Apple Silicon Macos. However, I faced great difficulty with hard-coded SIMD plugins, which failed to compile on non-x86 platforms. Currently I have to manually modify the code to remove all these SIMD optimizations. Do you have any ideas?

Here's an example of the ported plugin: https://github.com/yuygfgg/neo_f3kdb_crossplatform

Stefan-Olt commented 3 months ago

Great to hear that you want to port plugins!

For Linux and macOS x86_64 the SIMD optimizations can almost certainly be used as they are (only exception: pure assembly files that us the Windows-sytle calling convention hard coded, those would have to be adopted). I was able to compile the original plugin on Linux without any problem. For SSE intrinsics like in this plugin there is a simple solution for ARM processors like Apple Silicon: sse2neon: https://github.com/DLTcollab/sse2neon It can convert all the SSE SIMD instructions to NEON SIMD instructions. It's of course not the optimal solution, as not all SSE instructions can be mapped directly to NEON and need multiple instructions, while some NEON instructions that could improve speed that do not have an SSE equivalent aren't used. But I still noticed a massive speed improvement, it is used for example in mvtools (notice that mvtools also uses hand-written aarch64 assembly taken from x264). All that's needed is to include the sse2neon.h file, nothing needs to be installed. I would do that using the pre-processor, if it's a ARM/aarch64 platform, include sse2neon, otherwise include the x86 SSE headers. In general I would not remove any code, the pre-processor can disable code in certain scenarios. This way there is a chance that you can create a patch that is accepted upstream and you don't have to maintain a separate plugin

yuygfgg commented 3 months ago

Great to hear that you want to port plugins!

For Linux and macOS x86_64 the SIMD optimizations can almost certainly be used as they are (only exception: pure assembly files that us the Windows-sytle calling convention hard coded, those would have to be adopted). I was able to compile the original plugin on Linux without any problem. For SSE intrinsics like in this plugin there is a simple solution for ARM processors like Apple Silicon: sse2neon: https://github.com/DLTcollab/sse2neon It can convert all the SSE SIMD instructions to NEON SIMD instructions. It's of course not the optimal solution, as not all SSE instructions can be mapped directly to NEON and need multiple instructions, while some NEON instructions that could improve speed that do not have an SSE equivalent aren't used. But I still noticed a massive speed improvement, it is used for example in mvtools (notice that mvtools also uses hand-written aarch64 assembly taken from x264). All that's needed is to include the sse2neon.h file, nothing needs to be installed. I would do that using the pre-processor, if it's a ARM/aarch64 platform, include sse2neon, otherwise include the x86 SSE headers. In general I would not remove any code, the pre-processor can disable code in certain scenarios. This way there is a chance that you can create a patch that is accepted upstream and you don't have to maintain a separate plugin

Great to hear that. I would give it a try right away.

btw, I'm now aiming to port all frequently used plugins to arm macos. It would be great if you have a documentation of this project so others like me can contribute more easily.

Stefan-Olt commented 3 months ago

This is currently experimental, my goal is to

Get my Linux/macOS builds included in vsrepo (I already submitted a patch for Linux/macOS support, but not yet merged)
Make the process a bit more automatic: Currently I create the json build definitions by hand, my goal is to have a script that can update them automatically and create them by analyzing what build system is used (most likely not perfect and minor adjustments have to be made)
Improve documentation to help people getting more plugins working on Linux/macOS that are currently Windows or x86_64 focused

Please note that for most plugins porting is not really needed, if they don't include any SIMD they will either be directly compileable or need some minor fixes in the build system. I would highly encourage you to not create forks, but rather fixes that can be merged in the plug-in repo and have a plugin that can compile on many platforms. In case the plugin is not actively maintained or the author doesn't want to merge the fix, my build tool has the ability to apply a patch.

yuygfgg commented 3 months ago

This is currently experimental, my goal is to

Get my Linux/macOS builds included in vsrepo (I already submitted a patch for Linux/macOS support, but not yet merged)

Make the process a bit more automatic: Currently I create the json build definitions by hand, my goal is to have a script that can update them automatically and create them by analyzing what build system is used (most likely not perfect and minor adjustments have to be made)

Improve documentation to help people getting more plugins working on Linux/macOS that are currently Windows or x86_64 focused

Please note that for most plugins porting is not really needed, if they don't include any SIMD they will either be directly compileable or need some minor fixes in the build system. I would highly encourage you to not create forks, but rather fixes that can be merged in the plug-in repo and have a plugin that can compile on many platforms. In case the plugin is not actively maintained or the author doesn't want to merge the fix, my build tool has the ability to apply a patch.

That's exactly what I want. Now I hold these in here, simply pasting my every command.

~~And for sse2neon, I see it can directly replace *mmintrin.h, but how can I replace intrin.h without prefix and x86intrin.h?~~ solved myself

Stefan-Olt commented 3 months ago

I've seen that you tried my vsrepo fork: It seems that it works perfectly for you. It's database does not yet include builds for anything else than Windows, so it will tell you that there is no binary available for your platform, which is (unfortunately) fully correct. But you can already use it to install platform-independent scripts like havsfunc. Maybe you can test that and comment on the pull request I opened with the hope it get merged with more people having tested it: https://github.com/vapoursynth/vsrepo/pull/224

yuygfgg commented 3 months ago

I've seen that you tried my vsrepo fork: It seems that it works perfectly for you. It's database does not yet include builds for anything else than Windows, so it will tell you that there is no binary available for your platform, which is (unfortunately) fully correct. But you can already use it to install platform-independent scripts like havsfunc. Maybe you can test that and comment on the pull request I opened with the hope it get merged with more people having tested it: vapoursynth/vsrepo#224

It's true that I can install scripts like havsfunc, but I still need to manually compile the hundreds of dependencies. I'm still working on that.

Stefan-Olt commented 3 months ago

Yes, some (very few) of them you can download from the releases here

yuygfgg commented 3 months ago

I've just observed 2 strange things.

Some of the x86 SIMD only plugins seem to compile without any modification on Arm, especially those using meson and ninja.
For these plugins I mentioned above, manually porting with sse2neon provides little performance improvement. For example, AddGain runs at 645.99fps with sse2neon, which is lower than the 671.52fps without sse2neon.

Also, sse2neon neo_f3kdb runs at 743.19fps, while the non-SIMD one got 776.95fps

Stefan-Olt commented 3 months ago

Some of the x86 SIMD only plugins seem to compile without any modification on Arm, especially those using meson and ninja.

That's not strange at all. It's good practice to ensure assembly code is only used on the correct platform. It's also not difficult to do that: Usually you'll write the C code first, then you figure what parts need the most time and could be optimized and you write alternative implementations of that functions in assembly. You use the C pre-processor to enable the optimized code only on the correct platform. At runtime (for x86) you detect the processor features and select the best implementation to use.

2. For these plugins I mentioned above, manually porting with sse2neon provides little performance improvement. For example, AddGain runs at 645.99fps with sse2neon, which is lower than the 671.52fps without sse2neon.

That is strange indeed. This is the result of my znedi3 experiment:

macOS 14 on Apple M1 Max:
nnedi3:                 66 fps
znedi3:                 16 fps
znedi3 (with sse2neon): 68 fps

Ubuntu 22.04 on Ryzen 9 5900X:
nnedi3:                 123 fps
znedi3:                 196 fps

As you can tell, sse2neon gave a massive improvement. nnedi3 does have native ARM Neon assembly, therefore it is already fast on Apple Silicon. nnedi3 does not have AVX assembly, I assume that's the main reason why it's faster on x86 over nnedi3

Are you sure the SSE functions are used? Most likely you'll have compiled the SSE parts, but you don't use them, because at the point where the implementation is choosen at runtime it will use the C implementation assuming that SSE is not available on ARM. At those points you have to modify the code

yuygfgg commented 3 months ago

Some of the x86 SIMD only plugins seem to compile without any modification on Arm, especially those using meson and ninja.

That's not strange at all. It's good practice to ensure assembly code is only used on the correct platform. It's also not difficult to do that: Usually you'll write the C code first, then you figure what parts need the most time and could be optimized and you write alternative implementations of that functions in assembly. You use the C pre-processor to enable the optimized code only on the correct platform. At runtime (for x86) you detect the processor features and select the best implementation to use.

For these plugins I mentioned above, manually porting with sse2neon provides little performance improvement. For example, AddGain runs at 645.99fps with sse2neon, which is lower than the 671.52fps without sse2neon.

That is strange indeed. This is the result of my znedi3 experiment:
macOS 14 on Apple M1 Max:
nnedi3:                 66 fps
znedi3:                 16 fps
znedi3 (with sse2neon): 68 fps

Ubuntu 22.04 on Ryzen 9 5900X:
nnedi3:                 123 fps
znedi3:                 196 fps
As you can tell, sse2neon gave a massive improvement. nnedi3 does have native ARM Neon assembly, therefore it is already fast on Apple Silicon. nnedi3 does not have AVX assembly, I assume that's the main reason why it's faster on x86 over nnedi3

Are you sure the SSE functions are used? Most likely you'll have compiled the SSE parts, but you don't use them, because at the point where the implementation is choosen at runtime it will use the C implementation assuming that SSE is not available on ARM. At those points you have to modify the code

I finally find out what happened. My test script set input video as output0, while processed one at output1. So vspipe simply output the raw video and the plugins aren't even called!

yuygfgg commented 3 months ago

I found using -Ofast -ftree-vectorize -fopenmp gives way larger optimization compared with sse2neon. Of course, using both is even better.

-O0 + sse2neo : 40fps -Ofast -ftree-vectorize: 400fps -Ofast -ftree-vectorize+ sse2neon: 590fps

yuygfgg commented 2 months ago

I'm now holding my Macos arm plugins at https://github.com/yuygfgg/Macos_vapoursynth_plugins

Stefan-Olt commented 1 month ago

I found using -Ofast -ftree-vectorize -fopenmp gives way larger optimization compared with sse2neon. Of course, using both is even better.

-O0 + sse2neo : 40fps -Ofast -ftree-vectorize: 400fps -Ofast -ftree-vectorize+ sse2neon: 590fps

I would not compile with -Ofast: This option allows reordering math calculations, this can cause rounding errors to propagate and reduce quality. -O0 is of course bad, it means no optimization at all, but fast compilation. I would recommend -O3 (this includes -ftree-vectorize), as it's the highest optimization level that is still standard conformant:

-O0: No optimization at all, very fast compilation, good for debugging
-O1: Optimizations that only slightly increase compile time
-O2: All optimizations from -O1, additionally those that can increase compile time a lot more, but not increase size of output binary
-O3: All optimizations from -O2, additionally those that could increase size of output binary
-Ofast: All optimizations from -O3, additionally those that violate language specifications (like reordering math operations resulting in different results due to rounding)

The difference between -O3 and -Ofast is probably also not that big (in general the differences get smaller at higher optimization levels), have you tried that?

yuygfgg commented 1 month ago

I found using -Ofast -ftree-vectorize -fopenmp gives way larger optimization compared with sse2neon. Of course, using both is even better. -O0 + sse2neo : 40fps -Ofast -ftree-vectorize: 400fps -Ofast -ftree-vectorize+ sse2neon: 590fps

I would not compile with -Ofast: This option allows reordering math calculations, this can cause rounding errors to propagate and reduce quality. -O0 is of course bad, it means no optimization at all, but fast compilation. I would recommend -O3 (this includes -ftree-vectorize), as it's the highest optimization level that is still standard conformant:
-O0: No optimization at all, very fast compilation, good for debugging
-O1: Optimizations that only slightly increase compile time
-O2: All optimizations from -O1, additionally those that can increase compile time a lot more, but not increase size of output binary
-O3: All optimizations from -O2, additionally those that could increase size of output binary
-Ofast: All optimizations from -O3, additionally those that violate language specifications (like reordering math operations resulting in different results due to rounding)
The difference between -O3 and -Ofast is probably also not that big (in general the differences get smaller at higher optimization levels), have you tried that?

yeah I have realized that

I'm using Ofast only after testing now.

O3 is often a tiny bit slower (for BM3D, 8.56fps vs 8.13 fps on M2pro)

btw. BM3D always output differently using neon and C, both sse2neon (all precision flag on) and my handwritten neon version. The difference is larger than that between SSE and C.

Stefan-Olt / vs-plugin-build

Porting SIMD plugins #1