Request for clarification: What target is actually built?

GitMensch commented 2 years ago

It would be good to know this, but it isn't part of the documentation:

Is the generation identical to --enable-fat- so contains code for all CPUs (bigger lib)?
Is the generation identical to generic target (so much smaller than the above, but worst performance, also working on all CPUs)?
Generation for CPU x?

Is it possible to explicit generate for a specific CPU?

Sibras commented 2 years ago

Is the generation identical to --enable-fat- so contains code for all CPUs (bigger lib)?

No, only contains a single implementation
* Is the generation identical to generic target (so much smaller than the above, but worst performance, also working on all CPUs)?

The 32bit version targets any generic x86 processor (so works on anything). The 64bit is based on a SSE3 or newer requirement as all 64bit processors must support SSE2 as a minimum anyway and apart from some very old early AMD cpus all 64bit processors have SSE3 or better. The assembly code for the specific optimised functions is created from the following optimised variants (for x86_64):

x86_64/coreisbr/addmul_2.asm x86_64/core2/aorrlsh_n.asm x86_64/core2/aorrlsh1_n.asm x86_64/core2/aorrlsh2_n.asm x86_64/core2/aors_n.asm x86_64/core2/aorsmul_1.asm x86_64/bdiv_dbm1c.asm x86_64/bdiv_q_1.asm x86_64/cnd_aors_n.asm x86_64/fastsse/com-palignr.asm x86_64/core2/copyd.asm x86_64/core2/copyi.asm x86_64/div_qr_1n_pi1.asm x86_64/div_qr_2n_pi1.asm x86_64/dive_1.asm x86_64/core2/divrem_1.asm x86_64/divrem_2.asm x86_64/core2/gcd_11.asm x86_64/core2/hamdist.asm x86_64/invert_limb.asm x86_64/invert_limb_table.asm x86_64/core2/logops_n.asm x86_64/core2/lshift.asm x86_64/core2/lshiftc.asm x86_64/lshsub_n.asm x86_64/mod_1_1.asm x86_64/mod_1_2.asm x86_64/mod_1_4.asm x86_64/pentium4/mod_34lsub1.asm x86_64/mode1o.asm x86_64/mul_1.asm x86_64/mul_2.asm x86_64/core2/mul_basecase.asm x86_64/core2/mullo_basecase.asm x86_64/core2/popcount.asm x86_64/core2/redc_1.asm x86_64/core2/rsh1aors_n.asm x86_64/core2/rshift.asm x86_64/core2/sec_tabselect.asm x86_64/core2/sqr_basecase.asm x86_64/sqr_diag_addlsh1.asm x86_64/core2/sublsh1_n.asm x86_64/core2/sublsh2_n.asm

with default values from: x86_64/coreisbr/gmp-mparam.h

Is it possible to explicit generate for a specific CPU?

Not easily, you would have to replace the assembly files with the ones written for the different architecture. However, apart from 1 or 2 functions the 64b lib is using routines that are recommended for sandy bridge and newer architectures (still compatible with older hardware aswell) so theres not that much more performance wise that you can get out of it.

GitMensch commented 2 years ago

Thanks for the list - can you please include hints about that (possibly with referencing this issue) in the documentation?

In any case it would be good to also either directly add additional projects that would be more cpu-specific or document how to adjust the current one best (possibly a powershell/batch script that adjusts the project configuration?) - this likely could be handled in a different issue. Please also see the GMP mailing list where a user reported unexpected performance and use of functions, and, after switching to using the "correct" cpu (in this case by using msys2 to compile everything):

got a 4% improvement from doing this. I see the divide code now using bsr to do lzcnt so that makes more sense to me.

Otherwise... What do you think of switching to the "fat" configuration (compilation for all cpu's with runtime detection of features)?

Sibras commented 2 years ago

Please also see the GMP mailing list where a user reported unexpected performance and use of functions, and, after switching to using the "correct" cpu (in this case by using msys2 to compile everything):
got a 4% improvement from doing this. I see the divide code now using bsr to do lzcnt so that makes more sense to me.
Ahh thats actually a little bit different as it is due to gmp using inline assembly that is not supported by the msvc compiler. So as I mentioned previously the external assembly in 64b builds is already using modern optimised variants. The difference your referring to here is due to 'inline' assembly optimisations that are compiler specific and just do not work with Visual Studios compiler. New msvc specific code would have to be written and added for that (see longlong.h for where it is missing).

Otherwise... What do you think of switching to the "fat" configuration (compilation for all cpu's with runtime detection of features)?

Using fat requires a fair bit of work and last time I checked it had issues with msvc compilation (it was a long time ago though). And fat only enables switching for functions written in 'external' asm so it doesnt actually provide much advantage over the existing implementation as it still wont support the inline assembly.

So there will always be some slight performance differences between compilers but im guessing most of the missing performance is due to missing inline asm (atleast the lzcnt issue mentioned is inline asm). That requires writing custom msvc code, which for these sorts of functions is something ive done before but I wont have the time to actually test if any of the changes work.

GitMensch commented 2 years ago

That's good to know. So builds may have less performance than builds done from "upstream" with MSYS2 because of:

only common, but not all assembler optimizations are active (with a note how this can be manually adjusted, you've outlined it above - or just linking to this specific post)
MSVC does not allow to inline the GMP assembly

I suggest to wrap all that information here up in the docs (README?) and then close this issue as solved.

... and maybe create a follow-up issue for the last point which requires writing and adding custom msvc code (to be done later; if you could add one example with a CPU feature you can test, then you could reference it in that issue allowing others to work on this).

Sibras commented 2 years ago

... and maybe create a follow-up issue for the last point which requires writing and adding custom msvc code (to be done later; if you could add one example with a CPU feature you can test, then you could reference it in that issue allowing others to work on this).

Actually I just created a patch to add msvc intrinsics for some of the missing inline asm. I just need someone to check that the results are still valid (i.e. test that calculations are still correct). If i add the patch here are you capable of double checking it?

GitMensch commented 2 years ago

I just need someone to check that the results are still valid (i.e. test that calculations are still correct)

Hm, what about copying the Makefiles and scripts in from the matching release (just copy not-existing files) [or run .boostrap in WSL/MSYS2/Cygwin to generate those], then running the shipped tests in MSYS/MSYS2/Cygwin with the compiled library?

I'll have a look how this should be done if you aren't familiar with that.

Sibras commented 2 years ago

Hm, what about copying the Makefiles and scripts in from the matching release (just copy not-existing files) [or run .boostrap in WSL/MSYS2/Cygwin to generate those], then running the shipped tests in MSYS/MSYS2/Cygwin with the compiled library?

Not a bad idea, I gave it a go and was able to get some of the tests to run. Unfortunately due to the way the config script changes the build I coudnt get everything working due to symbol differences between the msvc compiled lib and mingw compiled tests. That and the fact that the changes are to macro defines and not to functions directly so some of the changes arent directly picked up in the mingw case.

I had however already tested the changes in isolation by comparing the output to the pre-existing gcc assembly and everything tested correctly which is why I have already added the changes to the repo. If someone has a benchmark they can use I would be interested to see what difference the changes made

GitMensch commented 2 years ago

If someone has a benchmark they can use I would be interested to see what difference the changes made.

There was someone on the GMP mailing list with a benchmark which could re-run it if the vcpkg package would be updated. Could you have a look at providing a new release and then update the reference in https://github.com/microsoft/vcpkg/blob/master/ports/gmp/portfile.cmake#L8 (maybe before check the patches used to have this repo building and consider directly applying them here before the release? The patches are one missing define __gmp_vasprintf - I guess there's no issue in adding that - and two adjustments to the VS files (haven't checked if those are appropriate).

As soon as a new version is available @ vcpkg I can ask "the benchmarker" for doing the checks.

xq114 commented 2 years ago

The 32bit version targets any generic x86 processor (so works on anything). The 64bit is based on a SSE3 or newer requirement as all 64bit processors must support SSE2 as a minimum anyway and apart from some very old early AMD cpus all 64bit processors have SSE3 or better. The assembly code for the specific optimised functions is created from the following optimised variants (for x86_64):

x86_64/coreisbr/addmul_2.asm x86_64/core2/aorrlsh_n.asm x86_64/core2/aorrlsh1_n.asm x86_64/core2/aorrlsh2_n.asm x86_64/core2/aors_n.asm x86_64/core2/aorsmul_1.asm x86_64/bdiv_dbm1c.asm x86_64/bdiv_q_1.asm x86_64/cnd_aors_n.asm x86_64/fastsse/com-palignr.asm x86_64/core2/copyd.asm x86_64/core2/copyi.asm x86_64/div_qr_1n_pi1.asm x86_64/div_qr_2n_pi1.asm x86_64/dive_1.asm x86_64/core2/divrem_1.asm x86_64/divrem_2.asm x86_64/core2/gcd_11.asm x86_64/core2/hamdist.asm x86_64/invert_limb.asm x86_64/invert_limb_table.asm x86_64/core2/logops_n.asm x86_64/core2/lshift.asm x86_64/core2/lshiftc.asm x86_64/lshsub_n.asm x86_64/mod_1_1.asm x86_64/mod_1_2.asm x86_64/mod_1_4.asm x86_64/pentium4/mod_34lsub1.asm x86_64/mode1o.asm x86_64/mul_1.asm x86_64/mul_2.asm x86_64/core2/mul_basecase.asm x86_64/core2/mullo_basecase.asm x86_64/core2/popcount.asm x86_64/core2/redc_1.asm x86_64/core2/rsh1aors_n.asm x86_64/core2/rshift.asm x86_64/core2/sec_tabselect.asm x86_64/core2/sqr_basecase.asm x86_64/sqr_diag_addlsh1.asm x86_64/core2/sublsh1_n.asm x86_64/core2/sublsh2_n.asm

with default values from: x86_64/coreisbr/gmp-mparam.h

Is it possible to explicit generate for a specific CPU?

Not easily, you would have to replace the assembly files with the ones written for the different architecture. However, apart from 1 or 2 functions the 64b lib is using routines that are recommended for sandy bridge and newer architectures (still compatible with older hardware aswell) so theres not that much more performance wise that you can get out of it.

Could you please provide how you create those assembly codes from the original asm files? If I want to replace the files with ones written for another architecture, should I use any kind of generators or just rewrite them from scratch?

Sibras commented 2 years ago

The tuning values from 'gmp-mparam.h' can be changed by just copying across a different 'gmp-mparam.h' from one of the suitable folders within the 'mpn' directory and placing it into the SMP\x86 or SMP\x86-64 folder accordingly.

To change one of the assembly files to one written for a different architecture is a bit more demanding. Youll need a bash shell and the following code for x86:

cat > config.m4 << 'EOF'
dnl config.m4.  Generated automatically by configure.
changequote(<,>)
ifdef(<__CONFIG_M4_INCLUDED__>,,<
define(<CONFIG_TOP_SRCDIR>,<`../.'>)
define(<WANT_ASSERT>,0)
define(<WANT_PROFILING>,<`no'>)
define(<M4WRAP_SPURIOUS>,<no>)
define(<TEXT>, <.text>)
define(<DATA>, <.data>)
define(<LABEL_SUFFIX>, <:>)
define(<GLOBL>, <.globl>)
define(<GLOBL_ATTR>, <>)
define(<GSYM_PREFIX>, <_>)
define(<RODATA>, <  .section .rdata,"dr">)
define(<TYPE>, <>)
define(<SIZE>, <>)
define(<LSYM_PREFIX>, <L>)
define(<W32>, <.long>)
define(<ALIGN_LOGARITHMIC>,<no>)
define(<ALIGN_FILL_0x90>,<yes>)
define(<HAVE_COFF_TYPE>, <yes>)
define(<GOT_GSYM_PREFIX>, <>)
define(<WANT_SHLDL_CL>, <1>)
define(<SQR_TOOM2_THRESHOLD>,<30>)
define(<BMOD_1_TO_MOD_1_THRESHOLD>,<21>)
define(<SIZEOF_UNSIGNED>,<4>)
define(<GMP_LIMB_BITS>,32)
define(<GMP_NAIL_BITS>,0)
define(<GMP_NUMB_BITS>,eval(GMP_LIMB_BITS-GMP_NAIL_BITS))
>)
changequote(`,')
ifdef(`__CONFIG_M4_INCLUDED__',,`
include(CONFIG_TOP_SRCDIR`/mpn/asm-defs.m4')
include_mpn(`x86/x86-defs.m4')
define_not_for_expansion(`HAVE_HOST_CPU_i686')
define_not_for_expansion(`HAVE_ABI_32')
define_not_for_expansion(`HAVE_LIMB_LITTLE_ENDIAN')
define_not_for_expansion(`HAVE_DOUBLE_IEEE_LITTLE_ENDIAN')
')
define(`__CONFIG_M4_INCLUDED__')
EOF

Then for each desired assembly file (changing the filename and operation accordingly):

cp -f ./x86/aors_n.asm ./
m4 -DHAVE_CONFIG_H -D__GMP_WITHIN_GMP -DOPERATION_add_n aors_n.asm > ../SMP/mpn/x86/add_n.s

For x86_64:

cat > config.m4 << 'EOF'
dnl config.m4.
changequote(<,>)
ifdef(<__CONFIG_M4_INCLUDED__>,,<
define(<CONFIG_TOP_SRCDIR>,<`../.'>)
define(<WANT_ASSERT>,0)
define(<WANT_PROFILING>,<`no'>)
define(<M4WRAP_SPURIOUS>,<no>)
define(<TEXT>, <.text>)
define(<DATA>, <.data>)
define(<LABEL_SUFFIX>, <:>)
define(<GLOBL>, <.globl>)
define(<GLOBL_ATTR>, <>)
define(<GSYM_PREFIX>, <>)
define(<RODATA>, <  .section .rdata,"dr">)
define(<TYPE>, <>)
define(<SIZE>, <>)
define(<LSYM_PREFIX>, <L>)
define(<W32>, <.long>)
define(<ALIGN_LOGARITHMIC>,<no>)
define(<ALIGN_FILL_0x90>,<yes>)
define(<HAVE_COFF_TYPE>, <yes>)
define(<SQR_TOOM2_THRESHOLD>,<34>)
define(<BMOD_1_TO_MOD_1_THRESHOLD>,<16>)
define(<SIZEOF_UNSIGNED>,<4>)
define(<GMP_LIMB_BITS>,64)
define(<GMP_NAIL_BITS>,0)
define(<GMP_NUMB_BITS>,eval(GMP_LIMB_BITS-GMP_NAIL_BITS))
>)
changequote(`,')
ifdef(`__CONFIG_M4_INCLUDED__',,`
include(CONFIG_TOP_SRCDIR`/mpn/asm-defs.m4')
include_mpn(`x86_64/x86_64-defs.m4')
include_mpn(`x86_64/dos64.m4')
define_not_for_expansion(`HAVE_HOST_CPU_x86_64')
define_not_for_expansion(`HAVE_ABI_64')
define_not_for_expansion(`HAVE_LIMB_LITTLE_ENDIAN')
define_not_for_expansion(`HAVE_DOUBLE_IEEE_LITTLE_ENDIAN')
')
define(`__CONFIG_M4_INCLUDED__')
EOF

The for each assembly file:

cp -f ./x86_64/coreisbr/addmul_2.asm ./
m4 -DHAVE_CONFIG_H -D__GMP_WITHIN_GMP -DOPERATION_addmul_2 addmul_2.asm > ../SMP/mpn/x86_64/addmul_2.s

ShiftMediaProject / gmp

Request for clarification: What target is actually built? #11