Closed Quuxplusone closed 5 years ago
Bugzilla Link | PR22473 |
Status | RESOLVED FIXED |
Importance | P normal |
Reported by | Daniel Jasper (djasper@google.com) |
Reported on | 2015-02-05 03:10:29 -0800 |
Last modified on | 2019-03-08 11:18:15 -0800 |
Version | trunk |
Hardware | PC All |
CC | andrea.dibiagio@gmail.com, benny.kra@gmail.com, chandlerc@gmail.com, grosbach@apple.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, mkuper@google.com, quentin.colombet@gmail.com, spatel+llvm@rotateright.com |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also | PR22532, PR23155 |
I think this is intentional fallout from r195496.
Yep. Here is what IACA says specifically:
Haswell:
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - b.o
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 0.50 Cycles Throughput Bottleneck: FrontEnd, PORT2_AGU,
Port2_DATA, PORT3_AGU, Port3_DATA
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6
| 7 |
---------------------------------------------------------------------------------------
| Cycles | 0.2 0.0 | 0.2 | 0.5 0.5 | 0.5 0.5 | 0.0 | 0.2 | 0.2
| 0.0 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV -
Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is
expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles |
|
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
|
---------------------------------------------------------------------------------
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | CP
| movzx eax, byte ptr [rdi]
| 1 | 0.2 | 0.2 | | | | 0.2 | 0.2 | |
| cmp eax, esi
Total Num Of Uops: 2
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - a.o
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 0.50 Cycles Throughput Bottleneck: PORT2_AGU,
Port2_DATA, PORT3_AGU, Port3_DATA
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6
| 7 |
---------------------------------------------------------------------------------------
| Cycles | 0.2 0.0 | 0.2 | 0.5 0.5 | 0.5 0.5 | 0.0 | 0.2 | 0.2
| 0.0 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV -
Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is
expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles |
|
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
|
---------------------------------------------------------------------------------
| 2^ | 0.2 | 0.2 | 0.5 0.5 | 0.5 0.5 | | 0.2 | 0.2 | | CP
| cmp byte ptr [rdi], sil
Total Num Of Uops: 2
So, same number of microops, but one fewer instruction, and we're no longer
frontend-stalled.
Ivybridge and Sandybridge show the same.
The interesting thing is that while these have the same throughput, the two
instruction form puts more pressure on the frontend, and we are seeing lots of
frontend stalls in our benchmarking.
Back when Jim and I looked into this, we were trying to kill mystery stalls
around cmpb. I don't think we ever adequately explained what we were seeing in
bzip2, and I think Quentin made a later change to CodeGenPrep that more likely
addressed the underlying issue.
So I think we should probably just back out r195496. While there was *some*
problem with bzip2, this seemed even then to be an overly heavy hammer and now
we have some more concrete evidence that its hurting other cases. I'll point
the review thread here and loop everyone in and see if there is agreement.
Works for me. As long as we can do so in a way that doesn't regress bzip2 (which no one is AFAIK proposing), I have no issues solving this differently.
I see that there was good discussion about the trade-offs in the commit thread
of r195496:
http://comments.gmane.org/gmane.comp.compilers.llvm.cvs/167221
Summary:
Based on Agner / Intel / AMD, we'd expect that (some large subset of?) P6-based
Intel cores will be hurt by using byte comparisons. On the other hand, most
recent AMD cores should be helped slightly by the reduction in instructions and
code size. This is because Intel tracks subregisters for renaming while AMD
tracks whole registers.
I wish I had access to SPEC, but I don't. I wish LLVM test-suite was actually
usable for perf testing, but it's not (despite some recent efforts towards that
goal). So ad hoc perf testing ensued...for reproducability's sake:
1. I grabbed the bzip2 1.06 source from:
http://www.bzip.org/downloads.html
2. I happened to have the Mac gcc 4.9 tarball around from:
http://hpc.sourceforge.net/
3. I gunzip'd the gcc tarball to use as my test file for compression.
4. I built bzip2 using its default settings (-O2 as of this writing) with
relatively recent clang trunk built from -r227025.
5. I removed the ext'ifying code that was added with r195496 and compared bzip2
perf changes on Intel Sandy Bridge and AMD Jaguar (btver2).
This is the relative perf for byte-size comparisons with today's (ext'ing)
codegen as the baseline using 8 runs for each case:
Intel SNB: -4.09% (0.44% relative std dev)
AMD Jaguar: -0.26% (but with 1.21% relative std dev...so in the noise)
Based on profiles, it looks like bzip2 is just about the perfect macro test case for testing this codegen change. It has byte comparisons all over the hotspots in BZ2_compressBlock, mainGTU, and BZ2_blocksort.
FWIW, gcc 4.9 is generating the byte comparisons by default for this code.
Forgot one testing tidbit for the record: I didn't alter bzip2 to avoid any I/O (as I assume the SPEC test harness would) or in any other way, and I didn't add any fancy high-res timers. Just used this:
$ time ./bzip2 < gcc-4.9-bin.tar > /dev/null
That's very in line with the results I've seen. On my Ivy Bridge iMac, it was about 5% or so. I ran a test very similar to yours (I used the x86_64 shared cache file as test input).
The SPEC harness results, both 2k and 2k6, were always also in line with what I observed looking at the vanilla 1.0.6 package, so I don't think you're missing any interesting datapoints on that front.
Oh, and FWIW, when I was doing the original testing, a hand-modified assembly version just for mainGTU() got almost all of the gains.
As the discussion got somewhat fragmented between this bug and the review thread for Jim's patch, I just wanted to relay what I think the correct path forward is.
We need to add a pass that replaces movb (and movw) with movzbl (and movzwl) when the destination is a register and the high bytes aren't used. Then we need to benchmark bzip2 to ensure that this recovers all of the performance that forcing the use of cmpl did, and probably some other sanity benchmarking. Then we can swap the cmpl formation for the movzbl formation.
Sadly, I don't have time to look at this right now, but I wanted the steps that I at least had in mind documented somewhere.
-Chandler
Thanks, Jim and Chandler. Good to know that I can repro what you're seeing without resorting to SPEC. :)
I filed bug 22532 before I saw the last comment here. I think that's a separate but related issue to what we have here? I didn't dig to see how the movz in that case got generated.