Stefan-Olt / vs-plugin-build

VapourSynth plugin build system for Linux/macOS (experimental!)
GNU Lesser General Public License v2.1
2 stars 1 forks source link

Porting SIMD plugins #1

Open yuygfgg opened 4 weeks ago

yuygfgg commented 4 weeks ago

I'm also interested in porting VS plugins to macos and linux, especially Apple Silicon Macos. However, I faced great difficulty with hard-coded SIMD plugins, which failed to compile on non-x86 platforms. Currently I have to manually modify the code to remove all these SIMD optimizations. Do you have any ideas?

Here's an example of the ported plugin: https://github.com/yuygfgg/neo_f3kdb_crossplatform

Stefan-Olt commented 3 weeks ago

Great to hear that you want to port plugins!

For Linux and macOS x86_64 the SIMD optimizations can almost certainly be used as they are (only exception: pure assembly files that us the Windows-sytle calling convention hard coded, those would have to be adopted). I was able to compile the original plugin on Linux without any problem. For SSE intrinsics like in this plugin there is a simple solution for ARM processors like Apple Silicon: sse2neon: https://github.com/DLTcollab/sse2neon It can convert all the SSE SIMD instructions to NEON SIMD instructions. It's of course not the optimal solution, as not all SSE instructions can be mapped directly to NEON and need multiple instructions, while some NEON instructions that could improve speed that do not have an SSE equivalent aren't used. But I still noticed a massive speed improvement, it is used for example in mvtools (notice that mvtools also uses hand-written aarch64 assembly taken from x264). All that's needed is to include the sse2neon.h file, nothing needs to be installed. I would do that using the pre-processor, if it's a ARM/aarch64 platform, include sse2neon, otherwise include the x86 SSE headers. In general I would not remove any code, the pre-processor can disable code in certain scenarios. This way there is a chance that you can create a patch that is accepted upstream and you don't have to maintain a separate plugin

yuygfgg commented 3 weeks ago

Great to hear that you want to port plugins!

For Linux and macOS x86_64 the SIMD optimizations can almost certainly be used as they are (only exception: pure assembly files that us the Windows-sytle calling convention hard coded, those would have to be adopted). I was able to compile the original plugin on Linux without any problem. For SSE intrinsics like in this plugin there is a simple solution for ARM processors like Apple Silicon: sse2neon: https://github.com/DLTcollab/sse2neon It can convert all the SSE SIMD instructions to NEON SIMD instructions. It's of course not the optimal solution, as not all SSE instructions can be mapped directly to NEON and need multiple instructions, while some NEON instructions that could improve speed that do not have an SSE equivalent aren't used. But I still noticed a massive speed improvement, it is used for example in mvtools (notice that mvtools also uses hand-written aarch64 assembly taken from x264). All that's needed is to include the sse2neon.h file, nothing needs to be installed. I would do that using the pre-processor, if it's a ARM/aarch64 platform, include sse2neon, otherwise include the x86 SSE headers. In general I would not remove any code, the pre-processor can disable code in certain scenarios. This way there is a chance that you can create a patch that is accepted upstream and you don't have to maintain a separate plugin

Great to hear that. I would give it a try right away.

btw, I'm now aiming to port all frequently used plugins to arm macos. It would be great if you have a documentation of this project so others like me can contribute more easily.

Stefan-Olt commented 3 weeks ago

This is currently experimental, my goal is to

  1. Get my Linux/macOS builds included in vsrepo (I already submitted a patch for Linux/macOS support, but not yet merged)
  2. Make the process a bit more automatic: Currently I create the json build definitions by hand, my goal is to have a script that can update them automatically and create them by analyzing what build system is used (most likely not perfect and minor adjustments have to be made)
  3. Improve documentation to help people getting more plugins working on Linux/macOS that are currently Windows or x86_64 focused

Please note that for most plugins porting is not really needed, if they don't include any SIMD they will either be directly compileable or need some minor fixes in the build system. I would highly encourage you to not create forks, but rather fixes that can be merged in the plug-in repo and have a plugin that can compile on many platforms. In case the plugin is not actively maintained or the author doesn't want to merge the fix, my build tool has the ability to apply a patch.

yuygfgg commented 3 weeks ago

This is currently experimental, my goal is to

  1. Get my Linux/macOS builds included in vsrepo (I already submitted a patch for Linux/macOS support, but not yet merged)
  2. Make the process a bit more automatic: Currently I create the json build definitions by hand, my goal is to have a script that can update them automatically and create them by analyzing what build system is used (most likely not perfect and minor adjustments have to be made)
  3. Improve documentation to help people getting more plugins working on Linux/macOS that are currently Windows or x86_64 focused

Please note that for most plugins porting is not really needed, if they don't include any SIMD they will either be directly compileable or need some minor fixes in the build system. I would highly encourage you to not create forks, but rather fixes that can be merged in the plug-in repo and have a plugin that can compile on many platforms. In case the plugin is not actively maintained or the author doesn't want to merge the fix, my build tool has the ability to apply a patch.

That's exactly what I want. Now I hold these in here, simply pasting my every command.

And for sse2neon, I see it can directly replace *mmintrin.h, but how can I replace intrin.h without prefix and x86intrin.h? solved myself

Stefan-Olt commented 3 weeks ago

I've seen that you tried my vsrepo fork: It seems that it works perfectly for you. It's database does not yet include builds for anything else than Windows, so it will tell you that there is no binary available for your platform, which is (unfortunately) fully correct. But you can already use it to install platform-independent scripts like havsfunc. Maybe you can test that and comment on the pull request I opened with the hope it get merged with more people having tested it: https://github.com/vapoursynth/vsrepo/pull/224

yuygfgg commented 3 weeks ago

I've seen that you tried my vsrepo fork: It seems that it works perfectly for you. It's database does not yet include builds for anything else than Windows, so it will tell you that there is no binary available for your platform, which is (unfortunately) fully correct. But you can already use it to install platform-independent scripts like havsfunc. Maybe you can test that and comment on the pull request I opened with the hope it get merged with more people having tested it: vapoursynth/vsrepo#224

It's true that I can install scripts like havsfunc, but I still need to manually compile the hundreds of dependencies. I'm still working on that.

Stefan-Olt commented 3 weeks ago

Yes, some (very few) of them you can download from the releases here

yuygfgg commented 3 weeks ago

I've just observed 2 strange things.

  1. Some of the x86 SIMD only plugins seem to compile without any modification on Arm, especially those using meson and ninja.
  2. For these plugins I mentioned above, manually porting with sse2neon provides little performance improvement. For example, AddGain runs at 645.99fps with sse2neon, which is lower than the 671.52fps without sse2neon.

Also, sse2neon neo_f3kdb runs at 743.19fps, while the non-SIMD one got 776.95fps

Stefan-Olt commented 3 weeks ago
  1. Some of the x86 SIMD only plugins seem to compile without any modification on Arm, especially those using meson and ninja.

That's not strange at all. It's good practice to ensure assembly code is only used on the correct platform. It's also not difficult to do that: Usually you'll write the C code first, then you figure what parts need the most time and could be optimized and you write alternative implementations of that functions in assembly. You use the C pre-processor to enable the optimized code only on the correct platform. At runtime (for x86) you detect the processor features and select the best implementation to use.

2. For these plugins I mentioned above, manually porting with sse2neon provides little performance improvement. For example, AddGain runs at 645.99fps with sse2neon, which is lower than the 671.52fps without sse2neon.

That is strange indeed. This is the result of my znedi3 experiment:

macOS 14 on Apple M1 Max:
nnedi3:                 66 fps
znedi3:                 16 fps
znedi3 (with sse2neon): 68 fps

Ubuntu 22.04 on Ryzen 9 5900X:
nnedi3:                 123 fps
znedi3:                 196 fps

As you can tell, sse2neon gave a massive improvement. nnedi3 does have native ARM Neon assembly, therefore it is already fast on Apple Silicon. nnedi3 does not have AVX assembly, I assume that's the main reason why it's faster on x86 over nnedi3

Are you sure the SSE functions are used? Most likely you'll have compiled the SSE parts, but you don't use them, because at the point where the implementation is choosen at runtime it will use the C implementation assuming that SSE is not available on ARM. At those points you have to modify the code

yuygfgg commented 3 weeks ago
  1. Some of the x86 SIMD only plugins seem to compile without any modification on Arm, especially those using meson and ninja.

That's not strange at all. It's good practice to ensure assembly code is only used on the correct platform. It's also not difficult to do that: Usually you'll write the C code first, then you figure what parts need the most time and could be optimized and you write alternative implementations of that functions in assembly. You use the C pre-processor to enable the optimized code only on the correct platform. At runtime (for x86) you detect the processor features and select the best implementation to use.

  1. For these plugins I mentioned above, manually porting with sse2neon provides little performance improvement. For example, AddGain runs at 645.99fps with sse2neon, which is lower than the 671.52fps without sse2neon.

That is strange indeed. This is the result of my znedi3 experiment:

macOS 14 on Apple M1 Max:
nnedi3:                 66 fps
znedi3:                 16 fps
znedi3 (with sse2neon): 68 fps

Ubuntu 22.04 on Ryzen 9 5900X:
nnedi3:                 123 fps
znedi3:                 196 fps

As you can tell, sse2neon gave a massive improvement. nnedi3 does have native ARM Neon assembly, therefore it is already fast on Apple Silicon. nnedi3 does not have AVX assembly, I assume that's the main reason why it's faster on x86 over nnedi3

Are you sure the SSE functions are used? Most likely you'll have compiled the SSE parts, but you don't use them, because at the point where the implementation is choosen at runtime it will use the C implementation assuming that SSE is not available on ARM. At those points you have to modify the code

I finally find out what happened. My test script set input video as output0, while processed one at output1. So vspipe simply output the raw video and the plugins aren't even called!

yuygfgg commented 3 weeks ago

I found using -Ofast -ftree-vectorize -fopenmp gives way larger optimization compared with sse2neon. Of course, using both is even better.

-O0 + sse2neo : 40fps -Ofast -ftree-vectorize: 400fps -Ofast -ftree-vectorize+ sse2neon: 590fps

yuygfgg commented 1 week ago

I'm now holding my Macos arm plugins at https://github.com/yuygfgg/Macos_vapoursynth_plugins