port arch-specific code to msvc

shuffle2 commented 2 years ago

when first finding this project and skimming the readme, i saw that msvc is recommended against. After looking at the code a bit I'm assuming this is mainly because cpu_features for arm + x86 is not implemented in a way that's compatible with msvc.

I'm opening this issue to see if that's the case (or, is there some other factor involved?), and maybe to remind myself to come back and implement it (not sure when I'll have time, tho).

ebiggers commented 2 years ago

I just hardly ever use MSVC at all, so I haven't gotten the architecture-specific code working with it --- not just the runtime CPU feature detection, but all other architecture-specific code too. Also whenever I measure performance, I'm using gcc or clang, so those are the compilers I'm implicitly optimizing for.

I don't have any plans to start using MSVC more.

shuffle2 commented 2 years ago

I've made a start here: https://github.com/shuffle2/libdeflate/commits/msvc So far arm64 should be fully ported, although I don't have hardware supporting sha3 so I haven't actually tested that path. The other paths at least appear to be correct (have not done comparison benchmarks yet). I also have not gone back and ensured I didn't mess anything up wrt other compilers / arm targets pre-armv8 (it seems msvc no longer ships toolchain targeting armv7, so I'd only be able to easily check on other compilers, anyways).
edit: oops, the arm(32bit) msvc toolchain is still available, I just did not have it installed. I can install and test with that (however the target is essentially pointless). edit2: on arm64 I've used _ARM64_DISTINCT_NEON_TYPES which changes how msvc headers define primitive vector types, allowing initialization syntax currently used in libdeflate. There's no equivalent for ARM (I opened a bug here), and the arm32 target is low-priority enough that I will just leave it unimplemented until that is fixed.

Do you have any comments about the changes so far?

I think most of the remaining effort will be rewriting some x86-specific code to avoid gcc vector extensions.

To be clear, I'm hoping the code can be massaged such that msvc compat "just works", and you don't need to invest effort into it. But it will require change to some coding style (dropping gcc vector extension usage, maybe some tweaking to semantics of NATIVE,TARGET,INTRIN) which would be understandable if you're not enthusiastic about.

shuffle2 commented 2 years ago

On the ARM topic: one of the related comments on msvc bug tracker does make a good point - the ARM documentation specifies:

The vector data types and arrays of the vector data types cannot be initialized by direct literal assignment. You must initialize them using one of the load intrinsics.

Maybe I will come back to this and make libdeflate conform to that. I haven't decided if it's worth it, yet.

shuffle2 commented 2 years ago

OK, at this point x86,x64,arm64 are working on both msvc and gcc. for gcc, i've been checking that arm and arm64 still compile, and testing x64 performance vs. msvc.

Currently for x64 (amd zen2 - 3990x w/SMT enabled), it seems msvc and gcc are about equal for crc32, msvc is faster for adler32, and gcc is faster for actual compress/decompress. I haven't investigated what causes the perf differences at this point.

for checksum benchmarking, I'm using a 250MiB file. msvc: checksum -t -A reports ~61500 MB/s gcc: checksum -t -A reports ~38000 MB/s (~45000 MB/s before my changes, so there's something weird here)

for the benchmark executable, i'm currently seeing (using the same 250MiB .tgz file): msvc: 100 MB/s compress / 18000 MB/s decompress gcc: 112 MB/s compress / 20000 MB/s decompress (unlike adler32, there doesn't seem to be noticeable change before/after my changes, here)

msvc is 19.33.31629 with /arch:AVX2 gcc is gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0 with -march=native.

As far as msvc functionality goes, the only remaining thing would be to properly support function multi-targeting, which should just be shuffling some code around.

ebiggers commented 2 years ago

That's a lot of changes... Some of them look fine. If you open pull requests with some of the individual changes, I can consider them. I can't merge this all at once in its current form.

It's really unfortunate that MSVC doesn't support regular C operations on vectors like gcc and clang do. Do you know if Microsoft plans to fix that?

When adding any new restrictions to the codebase to make MSVC happy, when possible please find a way to make gcc and clang disallow the same thing too. For example, when removing the shadowed variables, -Wshadow should be added to CFLAGS.

Why can the __attribute__((target)) specifiers be removed? Don't they need to be replaced with something else? How does MSVC know which CPU features to target?

Please don't do things like #define __BMI2__. Just don't check for __BMI2__ if MSVC; check for whatever it needs.

What's wrong with const void *?

What's wrong with close(), read(), write(), etc.?

ebiggers commented 2 years ago

I'm just fixing a few of the small things myself: https://github.com/ebiggers/libdeflate/pull/222 and https://github.com/ebiggers/libdeflate/pull/221.

One question: shouldn't __STDC_VERSION__ < 201112L check for defined(__STDC_VERSION__) first?

shuffle2 commented 2 years ago

That's a lot of changes... Some of them look fine. If you open pull requests with some of the individual changes, I can consider them. I can't merge this all at once in its current form.

Yup, makes sense.

It's really unfortunate that MSVC doesn't support regular C operations on vectors like gcc and clang do. Do you know if Microsoft plans to fix that?

I opened a feature request about it. Voting on the bugs/requests requires logging in to the site but it does help items at least get attention.

When adding any new restrictions to the codebase to make MSVC happy, when possible please find a way to make gcc and clang disallow the same thing too. For example, when removing the shadowed variables, -Wshadow should be added to CFLAGS.

OK.

Why can the __attribute__((target)) specifiers be removed? Don't they need to be replaced with something else? How does MSVC know which CPU features to target?

There is no equivalent to __attribute__((target)). The only way to change which instructions the compiler backend may emit from (non-intrinsic) code is to use the /arch flag on the command line, which operates on a translation-unit level. MSVC always supports all intrinsics (for a given base architecture. It does not accept ARM intrinsics when compiling for x64 for example), no matter what /arch is set to. Furthermore, this flag is very coarse on x64, and even more so on other architectures.

This means the construct used in the x86 crc32 for example, cannot be directly used with msvc to generate different versions of the function. Instead, these instantiations of the function would need to be split into their own compilation units. Similarly, the BMI2 implementation of the decompress routine would need to be able to be processed as its own compilation unit which is compiled with /arch:AVX2.

I also opened a feature request about this.

Please don't do things like #define __BMI2__. Just don't check for __BMI2__ if MSVC; check for whatever it needs.

Do you mean it should be this instead:

#ifdef __BMI2__ || (defined(_MSC_VER) && defined(__AVX2__))
#  define HAVE_BMI2_NATIVE  1
...

? I was trying to provide __BMI2__ in case other parts of the code check for that instead of HAVE_BMI2_NATIVE. That isn't currently the case, however matchfinder_impl.h does check for __AVX2__ and __SSE2__ directly. Probably those should be changed to use HAVE_<X>_NATIVE? Note msvc does not define __SSE2__, either. There is a similar check in the arm version of the file which I changed to use HAVE_NEON_NATIVE.

What's wrong with const void *?

The intrinsic is technically defined with __m128i const * argument type. I was just being pedantic. There is nothing wrong with const void * per se. I guess I could change it back.

What's wrong with close(), read(), write(), etc.?

These are technically POSIX standard functions, and in some cases the version provided by msvc's CRT does not match the POSIX standard (e.g. return type of read/write is int). In general, Microsoft made the name change to prefer adhering to ISO C instead of POSIX. I could just silence the warning I guess.

shuffle2 commented 2 years ago

One question: shouldn't __STDC_VERSION__ < 201112L check for defined(__STDC_VERSION__) first?

Yes, you're right (it is not defined if using the default, legacy mode).

ebiggers commented 2 years ago

Why can the __attribute__((target)) specifiers be removed? Don't they need to be replaced with something else? How does MSVC know which CPU features to target?

There is no equivalent to __attribute__((target)). The only way to change which instructions the compiler backend may emit from (non-intrinsic) code is to use the /arch flag on the command line, which operates on a translation-unit level. MSVC always supports all intrinsics (for a given base architecture. It does not accept ARM intrinsics when compiling for x64 for example), no matter what /arch is set to. Furthermore, this flag is very coarse on x64, and even more so on other architectures.

This means the construct used in the x86 crc32 for example, cannot be directly used with msvc to generate different versions of the function. Instead, these instantiations of the function would need to be split into their own compilation units. Similarly, the BMI2 implementation of the decompress routine would need to be able to be processed as its own compilation unit which is compiled with /arch:AVX2.

I also opened a feature request about this.

I'm not very excited about splitting out code into separate compilation units to work around this. Especially because the library is designed to be compiled without any special compiler flags. Requiring per-file flags would ruin that.

Please don't do things like #define __BMI2__. Just don't check for __BMI2__ if MSVC; check for whatever it needs.

Do you mean it should be this instead:
#ifdef __BMI2__ || (defined(_MSC_VER) && defined(__AVX2__))
#  define HAVE_BMI2_NATIVE    1
...
? I was trying to provide __BMI2__ in case other parts of the code check for that instead of HAVE_BMI2_NATIVE. That isn't currently the case, however matchfinder_impl.h does check for __AVX2__ and __SSE2__ directly. Probably those should be changed to use HAVE_<X>_NATIVE? Note msvc does not define __SSE2__, either. There is a similar check in the arm version of the file which I changed to use HAVE_NEON_NATIVE.

Yes, the HAVE_*_NATIVE macros are already an abstraction layer over the native macros, so they should be used that way. Those direct uses of the native macros are older ones that I forgot to update.

What's wrong with const void *?

The intrinsic is technically defined with __m128i const * argument type. I was just being pedantic. There is nothing wrong with const void * per se. I guess I could change it back.

The pointer isn't actually properly aligned for __m128i *, so I'd rather just use void to avoid confusion.

What's wrong with close(), read(), write(), etc.?

These are technically POSIX standard functions, and in some cases the version provided by msvc's CRT does not match the POSIX standard (e.g. return type of read/write is int). In general, Microsoft made the name change to prefer adhering to ISO C instead of POSIX. I could just silence the warning I guess.

Could you define close to _close (et al.), or does that cause a problem too? If it still causes a problem, I'd prefer if you just suppressed the warning, given that there is no actual problem with using the non-underscore versions, if I understand correctly?

ebiggers commented 2 years ago

I'm really not sure that I want to go all the way to limit all the vector code to conform with MSVC's limitations. It makes things a lot messier to develop on other platforms.

Also none of this gets around the fact that I don't do performance testing with MSVC.

What is your use case for MSVC, anyway? If you're incorporating libdeflate into an application, have you considered just building a static library with gcc or clang, and using that? Or even just using the prebuilt static library that comes with every release?

shuffle2 commented 2 years ago

The intrinsic is technically defined with __m128i const * argument type. I was just being pedantic. There is nothing wrong with const void * per se. I guess I could change it back.

The pointer isn't actually properly aligned for __m128i *, so I'd rather just use void to avoid confusion.

OK. As an aside, I tried replacing all _mm_loadu_si128 in that file with _mm_lddqu_si128 just as a quick test, and didn't notice any perf difference, but was only looking at overall benchmark results, not fine grained analysis.

What's wrong with close(), read(), write(), etc.?

These are technically POSIX standard functions, and in some cases the version provided by msvc's CRT does not match the POSIX standard (e.g. return type of read/write is int). In general, Microsoft made the name change to prefer adhering to ISO C instead of POSIX. I could just silence the warning I guess.

Could you define close to _close (et al.), or does that cause a problem too? If it still causes a problem, I'd prefer if you just suppressed the warning, given that there is no actual problem with using the non-underscore versions, if I understand correctly?

_CRT_NONSTDC_NO_WARNINGS removes the deprecation attribute on the functions and will work fine. For the most part using the MS names is just a standards conformance thing.

I'm really not sure that I want to go all the way to limit all the vector code to conform with MSVC's limitations. It makes things a lot messier to develop on other platforms.

I definitely agree the source code looks more verbose, however the generated code should be literally identical (I'm aware "should" is doing a lot of work there).

What is your use case for MSVC, anyway? If you're incorporating libdeflate into an application, have you considered just building a static library with gcc or clang, and using that? Or even just using the prebuilt static library that comes with every release?

I just wanted to be able to get some idea of the performance of libdeflate. I recently moved the https://github.com/dolphin-emu/dolphin project to zlib-ng, and then realized libdeflate might be even better, so wanted to give it a shot. I also thought it might be a good stress test of sorts for msvc and be interesting to compare against the other compilers.

In any case, I was mostly just curious to get it to work, and if you can take some improvements from the branch it would be nice - I'm not really petitioning that you change libdeflate in ways you don't want. At least it's know what is required, now.

pps83 commented 1 year ago

@ebiggers I switched to libdeflate from zlib/zlib-ng because it was better/faster for my use case. I'm shocked to find out that it doesn't support msvc, or, better to say, doesn't give full perf with ms compiler.

What is your use case for MSVC, anyway?

Many projects/companies use MS compiler exclusively. Kind of not a big secret :)

ebiggers commented 1 year ago

It is very annoying writing intrinsics code with MSVC. See all the limitations of MSVC that have been discussed above. Additionally, MSVC is closed source and only available on one, closed source operating system. So IMO, it should not be "shocking" when people decide not to bother with it much...

The main blocker is probably MSVC's limitation where it requires that code compiled for different instruction sets be located in different .c files. That would require reorganizing a lot of code.

Have you considered just building a DLL with MinGW-w64, or even just using the precompiled one that I provide at https://github.com/ebiggers/libdeflate/releases, and using that from your MSVC project?

pps83 commented 1 year ago

I'll try your release (or build myself) just to see perf difference.

In general, I haven't seen issues with msvc building for multiple archs from single file. Even if you target some vanilla CPU, afaik you may still use advanced intrinsics for newer CPUs.

pps83 commented 1 year ago

Just tried to compile with clang-cl (clang that comes with VS2022). I also tried v1.14 static build from your releases.

For compression, difference wasn't big, roughly 5% improvement with your release vs build with MS compiler (not clang-cl). clang-cl game me 15% improvement in decompression vs regular ms compiler. Your release gave 20% improvement in decompression vs ms compiler. I use local fork from april 2022 when building with ms compiler, so, it might be missing improvements if there were any.

pps83 commented 1 year ago

with latest libdeflate git msvc and clang-cl are roughly identical in decompression in my test. msvc build is 2% faster in compression vs clang-cl. Mingw64 build from v.1.14 release is 5% faster in compression and almost 10% faster in decompression.

shuffle2 commented 1 year ago

That sounds…not roughly identical. Are you using my changes for msvc (you need to port my msbuild stuff to whatever you’re using)? In any case this seems more like a personal project; maybe if you (r company) wants msvc libdeflate you should maintain it yourself. Personally I’m still just using Zlib-ng

ebiggers commented 1 year ago

Visual Studio's clang-cl is better than MSVC, but when clang is operating in MSVC compatibility mode it still inherits some of the limitations of MSVC. Therefore the code that uses intrinsics is still disabled.

Note that most of the intrinsics code in libdeflate is for the checksums, which are just part of compression and decompression, and are not relevant at all if raw DEFLATE is used. So how much loss you get from MSVC (or VS clang-cl) depends on your use case(s).

Anyway, it sounds like the MinGW-w64 built static library and DLL both work with Visual Studio, so I'd encourage you to just use one of those for now.

ebiggers commented 1 year ago

Are you using my changes for msvc (you need to port my msbuild stuff to whatever you’re using)

Note that the master branch is using CMake now, so it is possible to generate a Visual Studio project using CMake (and this is already being tested by some of the CI tests).

shuffle2 commented 1 year ago

Note that the master branch is using CMake now, so it is possible to generate a Visual Studio project using CMake (and this is already being tested by some of the CI tests).

I was just pointing out that using my branch requires flags + etc that are defined in the msbuild files and not the existing libdeflate build system. It probably doesn't make sense to merge those settings into the current cmake scripts.

pps83 commented 1 year ago

That sounds…not roughly identical. Are you using my changes for msvc (you need to port my msbuild stuff to whatever you’re using)?

@shuffle2 No, I'm using latest libdeflate git. Your PR is a bit too messy, lots of commits. I think some of them were already integrated into libdeflate. Shouldn't be too hard to split into relevant separate pieces.

Note that the master branch is using CMake now, so it is possible to generate a Visual Studio project using CMake (and this is already being tested by some of the CI tests).

@ebiggers For some reason I dislike cmake. For my project I use VS exclusively. I know I can generate VS projects, but libdeflate is such a simple project that it doesn't really need a separate project and can be integrated directly into other projects. That's my build system that I use for libdeflate: https://github.com/pps83/libdeflate/commit/2af5b45041ccd8aa9f4f22c590a4a8360706a71b I simply add libdeflate.c to my utility project.

shuffle2 commented 1 year ago

lol. unsubscribe

ebiggers commented 1 year ago

I'd like to keep this issue open so that people don't open duplicate issues. I also need to investigate @pps83's observation "In general, I haven't seen issues with msvc building for multiple archs from single file. Even if you target some vanilla CPU, afaik you may still use advanced intrinsics for newer CPUs."

pps83 commented 1 year ago

I'd like to keep this issue open so that people don't open duplicate issues. I also need to investigate @pps83's observation "In general, I haven't seen issues with msvc building for multiple archs from single file. Even if you target some vanilla CPU, afaik you may still use advanced intrinsics for newer CPUs."

I made a sample project. It targets "No Enhanced Instructions (/arch:IA32)" and uses BMI1 intrinsics at the same time:

https://github.com/pps83/lzcnt-test/blob/master/lzcnt-test.cpp

So, unlike gcc, looks like ms compiler allows users to use higher level CPU instructions even if they aren't available to the target instruction set. In other words, flags like /arch:SSE, /arch:AVX2 control what compiler can generate from regular code, but doesn't limit what instrinsics can emit.

pps83 commented 1 year ago

lol. unsubscribe

Not sure why @shuffle2 got offended. Perhaps it's this my message:

Your PR is a bit too messy, lots of commits

I think that PR hardly can be accepted. @ebiggers, perhaps, doesn't want to spend a day making sure it's OK or accept it blindly. Especially, given that he thinks that msvc support isn't needed.

More over, the PR needs to be rebased with latest master. Some changes are already part of the master branch. It has unrelated changes, that shouldn't be there (like warning fixes). Some good stuff should be pushed as separate commits (stuff like ARCH_X64 for example).

pps83 commented 1 year ago

I've made a start here: https://github.com/shuffle2/libdeflate/commits/msvc

...

rebased msvc port with irrelevant changes removed and unrelated changes merged to master:

https://github.com/pps83/libdeflate/tree/master-msvc

still needs some minor work, but major issues are removed imo

ebiggers commented 1 year ago

This is basically done now, with a few exceptions. The biggest one, at least in theory, is that BMI2 code generation is not yet enabled in the decompressor. With gcc and clang, BMI2 speeds up decompression on x86_64 by 5-10%.

Unfortunately, MSVC's code generation is just bad. I tried /arch:AVX2 (the way to enable BMI2 with MSVC, since there is no dedicated BMI2 switch), but the decompression performance actually drops.

The other benefit of /arch:AVX2 should be crc32_x86_pclmul_avx(). However, MSVC doesn't actually generate the 3-operand pclmul instructions, which was most of the point.

I think it makes sense to just consider this issue resolved, and any further MSVC optimizations can be considered separately.

I still recommend gcc or clang, of course, as they have better codegen and I've done way more performance testing with them...

Thanks to @shuffle2 and @pps83 for the contributions!

ebiggers / libdeflate

port arch-specific code to msvc #219