IlmBase SIMD optimization on Arm processors

sabotage3d commented 10 years ago

Hello , I couldn't find any info if there is SIMD optimization based on NEON for Arm processors in IlmBase. If not is there future plans for that ?

Thanks in advance,

Alex

peterhillman commented 10 years ago

There's nothing ARM specific in IlmBase or libImf that I'm aware of. Are you aware of any operations that would particularly benefit from such optimization?

The only SIMD optimization in OpenEXR is in libImf, for reading images into RGB or RGBA half float textures using SSE2. Porting that to ARM might make sense, though I'm not clear how many ARM based devices have GPUs support half float textures.

On 2014-04-26 02:19, sabotage3d wrote:

Hello , I couldn't find any info if there is SIMD optimization based on NEON for Arm processors in IlmBase. If not is there future plans for that ?

Thanks in advance,

Alex

Reply to this email directly or view it on GitHub [1].

Links:

[1] https://github.com/openexr/openexr/issues/96

sabotage3d commented 10 years ago

Thank you for your reply. According to a few articles there might be a huge speed up for Vector and Matrix operations using the Neon SIMD for arm processors. I am still researching whether it is possible to convert one to the other. It seems that Eigen already supports SSE, AltiVec and ARM NEON.

http://eigen.tuxfamily.org/index.php?title=FAQ

http://computer-vision-talks.com/articles/2011-02-08-a-very-fast-bgra-to-grayscale-conversion-on-iphone/

Thanks,

Alex

peterhillman commented 10 years ago

The OpenEXR library itself doesn't rely heavily on Vector and Matrix operations in libImath, so reading/writing images on Arm processors wouldn't be accelerated much by SIMD optimizations.

I believe that libImath types can be efficiently exchanged with libraries such as Eigen. Where performance is critical it might make sense to use such libraries to operate on libImath types.

sabotage3d commented 10 years ago

Thanks a lot for the tips I will make some tests between the libraries.

blackencino commented 10 years ago

I've been doing some testing with Eigen lately - Imath outperforms Eigen in all the cases I've tested so far, which have been mostly related to 3x3 and 4x4 matrix stuff (eigen vectors, svd, etc). Eigen's chief strengths lie in its wider range of solutions, but definitely not in performance.

Chris

On Sun, Apr 27, 2014 at 2:48 PM, peterhillman notifications@github.comwrote:

The OpenEXR library itself doesn't rely heavily on Vector and Matrix operations in libImath, so reading/writing images on Arm processors wouldn't be accelerated much by SIMD optimizations.

I believe that libImath types can be efficiently exchanged with libraries such as Eigen. Where performance is critical it might make sense to use such libraries to operate on libImath types.

— Reply to this email directly or view it on GitHubhttps://github.com/openexr/openexr/issues/96#issuecomment-41510412 .

I think this situation absolutely requires that a really futile and stupid gesture be done on somebody's part. And we're just the guys to do it.

sabotage3d commented 10 years ago

Did you try any SIMD optimizations, I am more interested to see the performance tests on arm based mobiles. Can you share your tests source codes I can do a quick test on the actual device. To get vectorization you need to use 4d matrices . Can you post your results as well ?

Alex

sabotage3d commented 10 years ago

I made a quick comparison with the other libraries. I ran the test under my Mac 10.8 x64 with Intel Core2 Quad Q6600 . These are the results:

Testing Eigen library Matrix4f class. Performing additions. Took 30 milliseconds. Performing multiplications. Took 94 milliseconds. Testing GLM library Matrix4f class. Performing additions. Took 133 milliseconds. Performing multiplications. Took 616 milliseconds. Testing CML library Matrix4f class. Performing additions. Took 186 milliseconds. Performing multiplications. Took 1136 milliseconds. Testing Imath library Matrix44 class. Performing additions. Took 139 milliseconds. Performing multiplications. Took 432 milliseconds.

meshula commented 10 years ago

I think there's an interesting subtext here. The Eigen library does not have an appropriate license for many users, and is extremely extensive compared to a lot of common needs and is somewhat burdensome to drag around on small projects due to its non-trivial size. The glm library is very useful where precision and correctness are a lesser concern, and its performance is being continually improved.

Imath gives reasonable performance with correctness and a fairly rich set of common operations and has a very friendly license. I think people come back to Imath again and again because it is concise, self contained, reasonably performant, easy to use, and has known correctness double checked by a good conformance suite. Peter's right that there are certain bulk operations worth porting to SIMD, but those operations do not really intersect the Imath problem space, so there is not much benefit to SIMD accelerations of Imath to OpenEXR. The subtext I think is that Imath itself has worth, independent of EXR, and optimizations there could be welcome by the larger community as long as the promises Imath makes as to ease of use, a reasonably large set of operations, and correctness are not violated.

@sabotage3d, it's difficult to make a judgement about what you are measuring there, since you haven't shown source. Operations like matrix multiplication are typically burdened by cache misses and less so by the math operations themselves. You can skew such bench marks one way or another by cache warming, or contriving to keep all the operations in registers.

I've had a go at SIMD accelerating Imath with coworkers in the past, and you can make impressive speed ups by contriving aligned loads and so on, but typically the compromise is that code becomes somewhat more difficult to use by virtue of what can be assigned to what, and trying to get type safety for types that end up aliased like vec3 and vec4. So far, the attempts I've seen compromise Imath's promise of conciseness and correctness, either strongly or weakly, and would push me towards a solution like glm when I want the extra speed and reasonable dependencies.

I do feel like an accelerated Imath, or an accelerated Imath like library would be a welcome thing, if it was still Imath after the mod.

sopvop commented 10 years ago

I've done some tests and it seems that adding alignment annotation to Vec4f and Matrix4f help gcc generate better simd code. Adding it to Imath_.h should be easy and will help a bit with speed. Or you can typedef something like Vec4faligned and use it in your computation code. That will also help if you want to use mm intrinsics or eigen (which has a nice Map<> class for wrapping existing data) to use it in some places.

As @meshula said - cache is the biggest problem. If you try to make you code look like for (..) { result[i] = matrixA[i]*matrixB[i]} then compiler should produce quite nicely optimized code. At least gcc with --ftree-vectorize --fpmath=sse --fsee4.1 does that for me.

meshula commented 10 years ago

A sidecar header of aligned typedefs would be a nice non-intrusive addition. I imagine appropriate adornments exist for MSVC, icc, gcc, and clang, and they would need to be boiled into the right macro soup.

AcademySoftwareFoundation / openexr

IlmBase SIMD optimization on Arm processors #96

Links: