Closed ashbob999 closed 3 months ago
Thank you for patches, @ashbob999! You may find LibSee useful in tracking such things. I haven't properly released it yet, but it may still be handy 🤗
Currently we target sandybridge which gives use SSE4.1/POPCNT (no BMI), this means that serial is not actually serial. But changing it to an earlier CPU, would mean that we would lose POPCNT. So would these functions have to be dynamically dispatched as well?
Not sure. For 32-bit and 64-bit integers popcount SWAR variant is probably 4-6 cycles, right? If so, we can probably replace the intrinsic with SWAR on those.
Is there already an issue with tzcnt/lzcnt on processors without BMI, beacuse they will use bsf/bsr instead and they differ slightly?
Haven't seen such issues, but I probably lack the right kind of hardware. I am afraid of using hardware emulation for tests. Maybe wiser to use some old instance kinds on AWS? How do you catch those?
Should we enable AVX/AVX512 instructions for 32-bit?
Probably no need for that.
Should ARM also default to a lesser instruction set by default?
This is a big one. We may want to separate a few more levels of Arm, similar to how it's done for x86. What do you think a good set should look like?
Not sure. For 32-bit and 64-bit integers popcount SWAR variant is probably 4-6 cycles, right? If so, we can probably replace the intrinsic with SWAR on those.
Could do, just thinking about the different ways its built, we could use the SWAR versions only in the serial functions, then this way the AVX versions can use the supported native instructions. Obviously this might mean that on certain targets we might be under-utilizing them.
This would allow the dynamic libraries to hopefully run on any x86 hardware (unless there is other instructions that aren't supported).
Also if we were using them in the serial functions only, we would have to make sure that it doesn't severely affect tits performance. Do you have examples of the 32/64 bit SWAR versions of popcount
.
Another thing to note, it that clang/gcc already fallback to a non-native version when the arch is before sandybridge (although I don't know how efficient these are, and what MSVC does).
Haven't seen such issues, but I probably lack the right kind of hardware. I am afraid of using hardware emulation for tests. Maybe wiser to use some old instance kinds on AWS? How do you catch those?
I agree, I guess some testing with manually replacing the tzcnt/lzcnt
with bsf/bsr
instructions could highlight any problems (would also depend where they are used, as to whether the difference between them can be seen).
Should we enable AVX/AVX512 instructions for 32-bit?
Probably no need for that.
Okay, so should this be reflected in the CMake by adding checks for 32/64 bit?
This is a big one. We may want to separate a few more levels of Arm, similar to how it's done for x86. What do you think a good set should look like?
I don't know much about the different ARM targets, and what each one supports. Because when I run the ARM neon_serial test on my phone, it still reports neon support, so armv8-a
might still enable smid.
Hey @ashbob999! Have you noticed that the Cross Compilation (arm64, aarch64-linux-gnu) fails, as well as the Build Python 310 for windows-latest for 64-bit Arm, but for a different reason. I was planning to work on the library on Saturday. Any chance you have a patch we can test and merge?
Epic! I think it's time to merge!
This PR fixes many build issues related to the shared libraries.
CMake Fixes/Changes:
POSITION_INDEPENDENT_CODE
wasn't being checked on a per target basis.define_shared
).setup.py
method). This makes the dynamic dispatch work correctly when building shared libs.stringzillite
still having libc dependencies (MSVC, Clang).Code changes:
sz_assert
, because it uses libc function calls, and breaks when using-fPIC
.memcpy
,memset
,memmove
,memchr
).sz_fill_serial
, due to it being optimised tomemset
, which causes recursive call when overriding libc.TODO:
AVX2
/AVX512
on MVSC.Questions:
sandybridge
which gives useSSE4.1
/POPCNT
(noBMI
), this means that serial is not actually serial. But changing it to an earlier CPU, would mean that we would losePOPCNT
. So would these functions have to be dynamically dispatched as well?tzcnt
/lzcnt
on processors withoutBMI
, beacuse they will usebsf
/bsr
instead and they differ slightly?AVX
/AVX512
instructions for 32-bit?