Improve generated x86 code for AVX targets

Lastique commented 1 year ago

Prefer vmovdqu to vlddqu on CPUs supporting AVX. vlddqu has one extra cycle latency on Skylake and later Intel CPUs and is not merged to the following instructions as a memory operand, which makes the code slightly larger. Legacy SSE3 lddqu is still preferred because it is faster on Prescott and the same as movdqu on AMD CPUs. It also doesn't affect code size because movdqu cannot be converted to a memory operand as memory operands are required to be aligned in SSE.

Closes https://github.com/boostorg/uuid/issues/137.

Also, re-format the test code for MSVC bug 981648, no functional changes.

Lastique commented 1 year ago

@pdimov Peter, MSVC-14.0 fails with ICE in this CI run in is_contiguous_range implementation. Could you take a look please?

pdimov commented 1 year ago

I'll look into it.

codecov[bot] commented 1 year ago

Codecov Report

Merging #138 (86b1ae5) into develop (9df4da9) will not change coverage. The diff coverage is n/a.

:exclamation: Current head 86b1ae5 differs from pull request most recent head c7e0a70. Consider uploading reports for the commit c7e0a70 to get more accurate results

Additional details and impacted files

[![Impacted file tree graph](https://codecov.io/gh/boostorg/uuid/pull/138/graphs/tree.svg?width=650&height=150&src=pr&token=6falIr5npV&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=boostorg)](https://codecov.io/gh/boostorg/uuid/pull/138?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=boostorg) ```diff @@ Coverage Diff @@ ## develop #138 +/- ## ======================================== Coverage 83.92% 83.92% ======================================== Files 15 15 Lines 678 678 Branches 156 156 ======================================== Hits 569 569 Misses 18 18 Partials 91 91 ``` | [Impacted Files](https://codecov.io/gh/boostorg/uuid/pull/138?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=boostorg) | Coverage Δ | | |---|---|---| | [include/boost/uuid/detail/uuid\_x86.ipp](https://codecov.io/gh/boostorg/uuid/pull/138?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=boostorg#diff-aW5jbHVkZS9ib29zdC91dWlkL2RldGFpbC91dWlkX3g4Ni5pcHA=) | `100.00% <ø> (ø)` | | ------ [Continue to review full report in Codecov by Sentry](https://codecov.io/gh/boostorg/uuid/pull/138?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=boostorg). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=boostorg) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/boostorg/uuid/pull/138?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=boostorg). Last update [9df4da9...c7e0a70](https://codecov.io/gh/boostorg/uuid/pull/138?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=boostorg). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=boostorg).

Lastique commented 1 year ago

Ping @jeking3.

Lastique commented 1 year ago

Rebased and updated the code to prefer movdqu starting with SSE4.1. It doesn't matter on CPUs not supporting AVX, but it is possible that SSE4.1 code will run on a modern CPU that does prefer movdqu to lddqu.

pdimov commented 1 year ago

Why not just use movdqu everywhere?

Lastique commented 1 year ago

Because lddqu is better on NetBurst CPUs. And there's also a workaround for MSVC codegen bug.

pdimov commented 1 year ago

NetBurst CPUs

Really? Have you seen one recently (as in, in the last decade)? :-)

I had a Thinkpad with a P4D I gave away, its battery lasted about half an hour.

Lastique commented 1 year ago

Well, I'm fine with dropping support for NetBurst CPUs, but as I said, there's also MSVC bug, so the code wouldn't get much simpler anyway.

pdimov commented 1 year ago

It's still a simplification even if we still pretend to care about VS 2008. Not many parts of Boost still work with it, because it's not tested. (I still test msvc-9.0 on Appveyor in old libraries as a matter of habit but that's more of an exception and is not going to last.)

Lastique commented 1 year ago

The bug is only fixed in VS2015; VS2013 and before are affected.

Anyway, I've pushed a commit to use movdqu universally, except for the MSVC workaround.

pdimov commented 1 year ago

Yes but without the VS2008 path, it's a single ifdef over _ReadWriteBarrier.

Lastique commented 1 year ago

Ok, done. I don't want to remove the VS2008 workaround yet.

pdimov commented 1 year ago

Here's some discussion on lddqu vs movdqu, for reference: https://community.intel.com/t5/Intel-ISA-Extensions/LDDQU-vs-MOVDQU-guidelines/m-p/1178965

Lastique commented 1 year ago

Yeah, I forgot about that discussion, thanks for digging it up. Although it didn't result in a definitive answer - Intel reps didn't comment. In the end I was left with the opinion I had when I started it - use lddqu up until AVX, use vmovdqu with AVX and later, as before AVX lddqu is not worse than movdqu and is sometimes better.

Lastique commented 1 year ago

@pdimov Since apparently Boost.UUID is no longer actively maintained again, maybe you could merge this? The Codecov failure does not seem to be caused by this.

pdimov commented 1 year ago

I was going to wait until after the release, but I can merge it now if you insist.

Lastique commented 1 year ago

No, after the release is fine. Thanks.

Lastique commented 1 year ago

A gentle reminder about this PR.

boostorg / uuid

Improve generated x86 code for AVX targets #138

Codecov Report