Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

LLVM not using cmpb for unsigned char comparison #22472

Closed Quuxplusone closed 5 years ago

Quuxplusone commented 9 years ago
Bugzilla Link PR22473
Status RESOLVED FIXED
Importance P normal
Reported by Daniel Jasper (djasper@google.com)
Reported on 2015-02-05 03:10:29 -0800
Last modified on 2019-03-08 11:18:15 -0800
Version trunk
Hardware PC All
CC andrea.dibiagio@gmail.com, benny.kra@gmail.com, chandlerc@gmail.com, grosbach@apple.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, mkuper@google.com, quentin.colombet@gmail.com, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also PR22532, PR23155
Small reproduction:
  bool f(unsigned char* a, unsigned char b) { return a[0] == b; }

LLVM translates this to
        movzbl        (%rdi), %eax
        cmpl        %esi, %eax

GCC instead does
          cmpb    %dil, (%rsi)

Which according to IACA is better.
Quuxplusone commented 9 years ago

I think this is intentional fallout from r195496.

Quuxplusone commented 9 years ago
Yep. Here is what IACA says specifically:

Haswell:
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - b.o
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 0.50 Cycles       Throughput Bottleneck: FrontEnd, PORT2_AGU,
Port2_DATA, PORT3_AGU, Port3_DATA

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6
|  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.2    0.0  | 0.2  | 0.5    0.5  | 0.5    0.5  | 0.0  | 0.2  | 0.2
| 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV -
Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is
expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |
|
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |
|
---------------------------------------------------------------------------------
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP
| movzx eax, byte ptr [rdi]
|   1    | 0.2       | 0.2 |           |           |     | 0.2 | 0.2 |     |
| cmp eax, esi
Total Num Of Uops: 2

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - a.o
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 0.50 Cycles       Throughput Bottleneck: PORT2_AGU,
Port2_DATA, PORT3_AGU, Port3_DATA

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6
|  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.2    0.0  | 0.2  | 0.5    0.5  | 0.5    0.5  | 0.0  | 0.2  | 0.2
| 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV -
Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is
expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |
|
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |
|
---------------------------------------------------------------------------------
|   2^   | 0.2       | 0.2 | 0.5   0.5 | 0.5   0.5 |     | 0.2 | 0.2 |     | CP
| cmp byte ptr [rdi], sil
Total Num Of Uops: 2

So, same number of microops, but one fewer instruction, and we're no longer
frontend-stalled.

Ivybridge and Sandybridge show the same.

The interesting thing is that while these have the same throughput, the two
instruction form puts more pressure on the frontend, and we are seeing lots of
frontend stalls in our benchmarking.

Back when Jim and I looked into this, we were trying to kill mystery stalls
around cmpb. I don't think we ever adequately explained what we were seeing in
bzip2, and I think Quentin made a later change to CodeGenPrep that more likely
addressed the underlying issue.

So I think we should probably just back out r195496. While there was *some*
problem with bzip2, this seemed even then to be an overly heavy hammer and now
we have some more concrete evidence that its hurting other cases. I'll point
the review thread here and loop everyone in and see if there is agreement.
Quuxplusone commented 9 years ago

Works for me. As long as we can do so in a way that doesn't regress bzip2 (which no one is AFAIK proposing), I have no issues solving this differently.

Quuxplusone commented 9 years ago
I see that there was good discussion about the trade-offs in the commit thread
of r195496:
http://comments.gmane.org/gmane.comp.compilers.llvm.cvs/167221

Summary:
Based on Agner / Intel / AMD, we'd expect that (some large subset of?) P6-based
Intel cores will be hurt by using byte comparisons. On the other hand, most
recent AMD cores should be helped slightly by the reduction in instructions and
code size. This is because Intel tracks subregisters for renaming while AMD
tracks whole registers.

I wish I had access to SPEC, but I don't. I wish LLVM test-suite was actually
usable for perf testing, but it's not (despite some recent efforts towards that
goal). So ad hoc perf testing ensued...for reproducability's sake:

1. I grabbed the bzip2 1.06 source from:
http://www.bzip.org/downloads.html

2. I happened to have the Mac gcc 4.9 tarball around from:
http://hpc.sourceforge.net/

3. I gunzip'd the gcc tarball to use as my test file for compression.

4. I built bzip2 using its default settings (-O2 as of this writing) with
relatively recent clang trunk built from -r227025.

5. I removed the ext'ifying code that was added with r195496 and compared bzip2
perf changes on Intel Sandy Bridge and AMD Jaguar (btver2).

This is the relative perf for byte-size comparisons with today's (ext'ing)
codegen as the baseline using 8 runs for each case:

Intel SNB:  -4.09% (0.44% relative std dev)
AMD Jaguar: -0.26% (but with 1.21% relative std dev...so in the noise)
Quuxplusone commented 9 years ago

Based on profiles, it looks like bzip2 is just about the perfect macro test case for testing this codegen change. It has byte comparisons all over the hotspots in BZ2_compressBlock, mainGTU, and BZ2_blocksort.

FWIW, gcc 4.9 is generating the byte comparisons by default for this code.

Quuxplusone commented 9 years ago

Forgot one testing tidbit for the record: I didn't alter bzip2 to avoid any I/O (as I assume the SPEC test harness would) or in any other way, and I didn't add any fancy high-res timers. Just used this:

$ time ./bzip2 < gcc-4.9-bin.tar > /dev/null

Quuxplusone commented 9 years ago

That's very in line with the results I've seen. On my Ivy Bridge iMac, it was about 5% or so. I ran a test very similar to yours (I used the x86_64 shared cache file as test input).

The SPEC harness results, both 2k and 2k6, were always also in line with what I observed looking at the vanilla 1.0.6 package, so I don't think you're missing any interesting datapoints on that front.

Quuxplusone commented 9 years ago

Oh, and FWIW, when I was doing the original testing, a hand-modified assembly version just for mainGTU() got almost all of the gains.

Quuxplusone commented 9 years ago

As the discussion got somewhat fragmented between this bug and the review thread for Jim's patch, I just wanted to relay what I think the correct path forward is.

We need to add a pass that replaces movb (and movw) with movzbl (and movzwl) when the destination is a register and the high bytes aren't used. Then we need to benchmark bzip2 to ensure that this recovers all of the performance that forcing the use of cmpl did, and probably some other sanity benchmarking. Then we can swap the cmpl formation for the movzbl formation.

Sadly, I don't have time to look at this right now, but I wanted the steps that I at least had in mind documented somewhere.

-Chandler

Quuxplusone commented 9 years ago

Thanks, Jim and Chandler. Good to know that I can repro what you're seeing without resorting to SPEC. :)

I filed bug 22532 before I saw the last comment here. I think that's a separate but related issue to what we have here? I didn't dig to see how the movz in that case got generated.

Quuxplusone commented 5 years ago

This has been fixed since 3.9:

https://godbolt.org/z/mG2uo2

I've added a test case at rL355712