BinomialLLC / basis_universal

Basis Universal GPU Texture Codec
Apache License 2.0
2.72k stars 267 forks source link

ASTC weights SIMD encoding #298

Open ronanbel opened 2 years ago

ronanbel commented 2 years ago

ssse3 (I5 6300) : 163 => 136 ms arm (A53) : 340 => 282 ms I moved the block weight transform code in a single function : pack_astc_block_weights you can enable/disable the SIMD code with a define BASISD_ASTC_SIMD All the simd code is annotated

tested x86_64 on windows, compiled with VS2019 and clang 11 tested arm & arm64 on android, compiled with latest NDK (clang11)

if needed, you can get in touch at : ronan.bel@gmail.com ronan.bel@ubisoft.com

richgel999 commented 2 years ago

Thank you - this is great. I normally shy away from merging code that I can't easily maintain, but let me see what I can do. How much does this help encoding perf.?

ronanbel commented 2 years ago

Hi Richard. I did a second submit today (there was an error in the first one) with additional optimization (uastc_unpack) on my test device (arm a53), it lasts 200ms from 340ms (base code) to transcode the uastc to astc (the same on an A57) it's also faster on PC I'm using a big texture for the tests, do a CRC32 at the end to detect if the decoding gives the same result. If you need some other optims or modifications, don't hesitate to let me know. And if you have a unit-test asset (a test.basis) which has all modes & subsets & anchor patterns, it would help me to track my modifications are valid

have a nice day

Le jeu. 12 mai 2022 à 09:00, Rich Geldreich @.***> a écrit :

Thank you - this is great. I normally shy away from merging code that I can't easily maintain, but let me see what I can do. How much does this help encoding perf.?

— Reply to this email directly, view it on GitHub https://github.com/BinomialLLC/basis_universal/pull/298#issuecomment-1124606814, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOL7CCQAODQFKKYWU5QX2LVJSUBBANCNFSM5VIMSKDQ . You are receiving this because you authored the thread.Message ID: @.***>

ronanbel commented 2 years ago

re-hi (from home after a long day at work)

define BASISD_ASTC_SIMD // 287 => 240

define BASISD_UASTC_SIMD // 240 => 222 (231 without

BASISD_USE_UNALIGNED_WORD_READS)

define BASISD_ASTC_24WRITE

here are the switches to enable most of my modifications (some like a hint for the compiler to do a fast div3/5 are not between #if) I'm using this version of the code on the project I'm working on, at least on this dataset, all is fine. I may do some other modifications (if needed, if I have some spare time because I did most of this during the week-end) I understand what you mean by "I can't easily maintain", I would do the same. All I can say is that I will continue to work in the game industry for at least 15/20 years (probably not at Ubisoft), so I think you can ask me for maintenance for the next decade :) (anyway, you may leave this between #if defined and leave the user the ability to use the simd version if he needs some speedup) (there is not so much simd code, most added lines are here to setup tables for further simd processing)

Le jeu. 12 mai 2022 à 20:06, ronan bel @.***> a écrit :

Hi Richard. I did a second submit today (there was an error in the first one) with additional optimization (uastc_unpack) on my test device (arm a53), it lasts 200ms from 340ms (base code) to transcode the uastc to astc (the same on an A57) it's also faster on PC I'm using a big texture for the tests, do a CRC32 at the end to detect if the decoding gives the same result. If you need some other optims or modifications, don't hesitate to let me know. And if you have a unit-test asset (a test.basis) which has all modes & subsets & anchor patterns, it would help me to track my modifications are valid

have a nice day

Le jeu. 12 mai 2022 à 09:00, Rich Geldreich @.***> a écrit :

Thank you - this is great. I normally shy away from merging code that I can't easily maintain, but let me see what I can do. How much does this help encoding perf.?

— Reply to this email directly, view it on GitHub https://github.com/BinomialLLC/basis_universal/pull/298#issuecomment-1124606814, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOL7CCQAODQFKKYWU5QX2LVJSUBBANCNFSM5VIMSKDQ . You are receiving this because you authored the thread.Message ID: @.***>

ronanbel commented 2 years ago

How much does this help encoding perf sorry I didn't answer your questions (12h day at work today, I'm tired) I don't think it helps the encoder it's only in the transcoder one part in the uastc_unpack one part in the transcoding to astc (the one I needed) all is for the runtime client code (parsing uastc, then converting the astc)

Le jeu. 12 mai 2022 à 20:51, ronan bel @.***> a écrit :

re-hi (from home after a long day at work)

define BASISD_ASTC_SIMD // 287 => 240

define BASISD_UASTC_SIMD // 240 => 222 (231 without

BASISD_USE_UNALIGNED_WORD_READS)

define BASISD_ASTC_24WRITE

here are the switches to enable most of my modifications (some like a hint for the compiler to do a fast div3/5 are not between #if) I'm using this version of the code on the project I'm working on, at least on this dataset, all is fine. I may do some other modifications (if needed, if I have some spare time because I did most of this during the week-end) I understand what you mean by "I can't easily maintain", I would do the same. All I can say is that I will continue to work in the game industry for at least 15/20 years (probably not at Ubisoft), so I think you can ask me for maintenance for the next decade :) (anyway, you may leave this between #if defined and leave the user the ability to use the simd version if he needs some speedup) (there is not so much simd code, most added lines are here to setup tables for further simd processing)

Le jeu. 12 mai 2022 à 20:06, ronan bel @.***> a écrit :

Hi Richard. I did a second submit today (there was an error in the first one) with additional optimization (uastc_unpack) on my test device (arm a53), it lasts 200ms from 340ms (base code) to transcode the uastc to astc (the same on an A57) it's also faster on PC I'm using a big texture for the tests, do a CRC32 at the end to detect if the decoding gives the same result. If you need some other optims or modifications, don't hesitate to let me know. And if you have a unit-test asset (a test.basis) which has all modes & subsets & anchor patterns, it would help me to track my modifications are valid

have a nice day

Le jeu. 12 mai 2022 à 09:00, Rich Geldreich @.***> a écrit :

Thank you - this is great. I normally shy away from merging code that I can't easily maintain, but let me see what I can do. How much does this help encoding perf.?

— Reply to this email directly, view it on GitHub https://github.com/BinomialLLC/basis_universal/pull/298#issuecomment-1124606814, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOL7CCQAODQFKKYWU5QX2LVJSUBBANCNFSM5VIMSKDQ . You are receiving this because you authored the thread.Message ID: @.***>

ronanbel commented 1 year ago

Hi, yes you can do a cast on the shift(always left) to satisfy GCC and you can replace the vdup_n_u8(0) by a vdup_16(0), logically it's the same (I'm only using clang & msvc) (sorry I don't have git on this PC)

Le lun. 10 avr. 2023 à 05:26, Andrei Alexeyev @.***> a écrit :

@.**** commented on this pull request.

In transcoder/basisu_transcoder.cpp https://github.com/BinomialLLC/basis_universal/pull/298#discussion_r1161405354 :

  • uint16x8_t bitMask0 = vshlq_u16( vdupq_n_u16(1), bitNum0 ); // bitMask = (1U << n) - 1U
  • uint16x8_t bitMask1 = vshlq_u16( vdupq_n_u16(1), bitNum1 ); // bitMask = (1U << n) - 1U

This fails on GCC without -flax-vector-conversions:

../subprojects/basis_universal/transcoder/basisu_transcoder.cpp: In function 'void basist::pack_astc_block_weights(uint8_t, const uint8_t, int, int)': ../subprojects/basis_universal/transcoder/basisu_transcoder.cpp:12010:80: note: use '-flax-vector-conversions' to permit conversions between vectors with differing element types or numbers of subparts 12010 uint8x8_t rev8 = vqmovn_u16( vcombine_u16( rev8lohi, vdup_n_u8(0) ) ); // 8bits in 4 u8 (clear lower 32) ~~^~~~~~~~ ../subprojects/basis_universal/transcoder/basisu_transcoder.cpp:12010:101: error: cannot convert 'uint8x8_t' to 'uint16x4_t' 12010 uint8x8_t rev8 = vqmovn_u16( vcombine_u16( rev8lohi, vdup_n_u8(0) ) ); // 8bits in 4 u8 (clear lower 32) ~~~^
uint8x8_t

An explicit cast fixes it: ⬇️ Suggested change

  • uint16x8_t bitMask0 = vshlq_u16( vdupq_n_u16(1), bitNum0 ); // bitMask = (1U << n) - 1U
  • uint16x8_t bitMask1 = vshlq_u16( vdupq_n_u16(1), bitNum1 ); // bitMask = (1U << n) - 1U
  • uint16x8_t bitMask0 = vshlq_u16( vdupq_n_u16(1), (int16x8_t)bitNum0 ); // bitMask = (1U << n) - 1U
  • uint16x8_t bitMask1 = vshlq_u16( vdupq_n_u16(1), (int16x8_t)bitNum1 ); // bitMask = (1U << n) - 1U

— Reply to this email directly, view it on GitHub https://github.com/BinomialLLC/basis_universal/pull/298#pullrequestreview-1377041364, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOL7CARDZELDN4VTHPMGJDXAN4YBANCNFSM5VIMSKDQ . You are receiving this because you authored the thread.Message ID: @.***>