Closed GoogleCodeExporter closed 9 years ago
Shifting for mulli probably will require an extsw at the end to extend the sign.
Also, this is handy:
http://cellperformance.beyond3d.com/articles/2006/04/avoiding-microcoded-instruc
tions-on-the-ppu.html
Original comment by classi...@floodgap.com
on 19 Dec 2011 at 5:32
In fact, look at the conds in general in branch[MathOP]32s. If they are
overflow only, do NOT use the _rc variant as it will be generally slower on G5
(many are microcoded).
Original comment by classi...@floodgap.com
on 19 Dec 2011 at 5:33
Thinking about it, branchMul32 needs to have the move() for mullwo if the cond
is Overflow, so it can't use x_sr_mulli. :(
Original comment by classi...@floodgap.com
on 19 Dec 2011 at 5:36
- We can actually preload the constants for float conversion/truncation into
callee-saved FPRs (of which we have even more going unused), and do it in the
Trampoline rather than in each stack frame.
-- but there may be one or two extra-common "immediates" that we could cache in
GPRs? (hard to say what they would be, though...)
- Can we use the FI (fraction inexact) bit (FPSCR[38]) to avoid the
song-and-dance in the the double-to-int conversion?
- I think we can move 4 of the 5 instructions in the mcrxr emulation to
_before_ the lis/ori/mtctr in m_branch, so that other branches don't all need 4
nops on G5? It'll make the code uglier, but not too much so.
- I wonder how often BaseIndex operands have a Scale of TimesOne. We should
probably elide the shift if it's common enough to make up for the compilation
time hit.
- How much do unconditional branches cost? Is it worth optimizing branches by
trying to convert the lis/ori/mtctr to a bc() (then possibly the last
instruction of the mcrxr emulation) and then a b() over the rest of the branch
block?
Original comment by magef...@gmail.com
on 19 Dec 2011 at 11:42
Figuring out common immediates requires a degree of profiling. That said, zero
is an obvious one; there is no hardcoded register 0 in the PPC ISA like there
is in SPARC or MIPS. So that might work.
Testing bits in FPSCR is absolutely insane. You have to somehow get the bit you
want into a CR for branching and it's not easy. That was my first instinct and
the code I wrote was such a mess that I went with the current approach. If you
can think of an easy way to do it, though, I'm all ears.
Jumps must be fixed length or patching will fail, even unconditional ones. Any
change you make to any type of jump must be made for all, conditional or
otherwise. This is the simplest way I came up with, but write out what you
suggest being a replacement. I'm not sure I quite understand your last
suggestion, though.
BaseIndex is almost never TimesOne.
Original comment by classi...@floodgap.com
on 19 Dec 2011 at 11:54
- Zero (and 1, and -1) are all rather uninteresting: it's unlikely that a
register-to-register move is any faster than an addi rN,r0,#immediate.
- What's wrong with using mcrfs
(http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%2Fcom.ibm.ai
x.aixassem%2Fdoc%2Falangref%2Fmffs.htm) to take the FI bit from the FPSCR and
drop it into a CR?
- Here's my proposal for G5 jumps:
// the 1st four instructions of the emulated mcrxr:
x_mfxer
rlwinm
mtcrf
rlwinm
// jumps that don't need mcrxr _start_ here:
lis
ori
mtctr
x_mtxer // or x_nop if no mcrxr or the 1st branch in m_fbranch or the mcrxr on <=G4
bcctr
JmpSrc:
So now PPC_JUMP_BACKWIND is only 20 on even G5. 4 of the mcrxr emulation
instructions no longer need to be thought of as part of the jump.
- My last proposal was
lis
ori
mtctr
x_nop or mcrxr or branch
bcctr
gets patched if possible to
bc (or mcrxr if needed)
j 1f (or bc if we had a mcrxr or need two conditional branches as in m_fbranch)
j 1f (or could leave as mtctr if previous instruction is j 1f)
x_nop or mcrxr or branch
bcctr
1f:
. Although
bc (or mcrxr)
x_nop (or bc)
x_nop
x_nop
x_nop
might be faster, depending how much unconditional jumps cost. Basically this is
the same type of optimization that MIPS does.
Original comment by magef...@gmail.com
on 20 Dec 2011 at 12:48
Your jumps idea doesn't account for the case that jumps exist in an unpatched
"skip state" first, unless I'm missing something. It is entirely possible, and
does happen, that a jump will be emitted that remains unpatched for later. We
do know the condition being tested for, so we can decide to emit the mcrxr
equivalents or not, but we're still executing it.
Also, submit patches :P but not yet because I need to get type inference
working before we can do anything else.
I seem to recall I hit some glitch testing FI that way, but I don't have the
code around anymore, so I'll have to look at that again.
Original comment by classi...@floodgap.com
on 20 Dec 2011 at 1:16
(also, I couldn't tell: where is the second branch coming off from in fbranch?)
Original comment by classi...@floodgap.com
on 20 Dec 2011 at 1:18
I'll certainly try to submit my ideas as patches, though since I only have a G4
I can't do much G5-specific performance testing... ^_-
I don't see why the skip state would be a problem: any code following the jump
has to tolerate either the mcrxr being executed or not, so partially emulating
it (which has no side effects that fully executing it doesn't) shouldn't be a
problem. The lis/ori can still start as a nop/x_skip_this_branch. (Though it's
probably faster if the x_skip_this_branch is first so that it even skips the
nop.)
The first branch in fbranch is the ordered/unordered test, which occupies the
instruction slot before the bcctr in the normal branch stanza; then there is
the branch for the actual numerical relation. Is that what you were asking?
Original comment by magef...@gmail.com
on 20 Dec 2011 at 1:49
One possible use for a zero register is that it makes testing XER[CA] easy, via
adde. We could stregth-reduce overflow-checking addition and subtraction to
carrying math:
addo
mcrxr
bc[overflow]
would become
addc
adde tempRegister,zeroRegister,zeroRegister
bc[nonzero]
which strength-reduces the mcrxr to an adde. Unfortunately, multiplication
still requires checking XER[OV].
Original comment by magef...@gmail.com
on 20 Dec 2011 at 3:02
An easy one: functions like add32(TrustedImm32, [Absolute]Address) currently do
the address generation twice, once in load32() and once in store32(). It should
be trivial to simplify that to only calculate the memory address once.
Original comment by magef...@gmail.com
on 25 Dec 2011 at 11:14
So now that type inference looks like a win, let's break down the ideas.
a) Ideas that don't require modifying the VMFrame or significantly refactoring
code (easy wins)
b) Ideas that require refactoring code, such as branches, etc. (not so easy)
c) Ideas that require modifying the VMFrame (scary: modifying the frame layout
can have unexpected side effects and the trampoline is fragile if offsets
change)
d) b+c (scary and not so easy)
a: simplify *(Imm32, Address) as suggested by Ben in comment 11;
strength-reduced x_sr_mulli (modulo overflow which must use mullwo); don't use
*_rc if the condition is Overflow; inline-able sqrt equivalent for G3/G4
(assuming we can do this with temp FP registers and don't need to spill them to
the VMFrame -- we could spill at most one to the temporary FP space at r1+16 in
the linkage area). We are already using truncate FP instructions in JM+TI, so
that is already done. Others in this category?
b: move mcrxr_equiv instructions out of the branch, needs to still keep branch
sizes the same; look again at FPSCR for inexact or failed floating point
conversion (trying to find my notes on what went wrong with this before)
c: zero register and other constant registers because we have to save the zero
register somewhere and we are using all the GPRs that don't need to be saved,
so we must widen the stack frame. we might be able to have a zero FPU register
though, and a zero FP reg is 64 zero bits, but we don't seem to use the FPU
much.
d: use zero register for branch testing
Others?
The target speed is to get the quad G5 below 1000 ms (currently with JM+TI,
1050) and my reference 1GHz 7450 below 2500ms (currently 2750). I think these
goals are definitely attainable. Once 9.0.1pre is banged on and found to be
solid, then it is time to optimize the shet out of the MacroAssembler.
Original comment by classi...@floodgap.com
on 27 Dec 2011 at 9:21
WRT spilling FPRs: since we're inlining the code and therefore by definition do
not call any functions within it, I think we can actually spill to the red zone
(http://developer.apple.com/library/mac/#documentation/DeveloperTools/Conceptual
/LowLevelABI/100-32-bit_PowerPC_Function_Calling_Conventions/32bitPowerPC.html#/
/apple_ref/doc/uid/TP40002438-SW20) -- which gives us plenty of space if we
need it.
Original comment by magef...@gmail.com
on 27 Dec 2011 at 11:11
I'll buy that (although we should mark such sections of code #ifdef
JS_CPU_PPC_OSX because Linux PPC doesn't have a red zone).
Original comment by classi...@floodgap.com
on 27 Dec 2011 at 11:14
Should we use emergencyTempRegister in add/sub32(Imm32, Address) rather than
special-casing -- and pessimizing -- tempRegister in add/sub32(Imm32,
RegisterID)? (We can leave an ASSERT(dest != tempRgister) for safety.)
Original comment by magef...@gmail.com
on 27 Dec 2011 at 11:26
I guess we could. It's unconventional but it does eliminate some of the
irritating limitations of r0 and it doesn't have to be saved.
Original comment by classi...@floodgap.com
on 27 Dec 2011 at 11:29
(and it does look like it's okay with Linux PPC, so it's probably ABI-safe in
general)
Original comment by classi...@floodgap.com
on 27 Dec 2011 at 11:30
Starting point for a)
Richards: 892
DeltaBlue: 1304
Crypto: 1897
RayTrace: 482
EarleyBoyer: 857
RegExp: 198
Splay: 1153
----
Score (version 6): 799
on the quad in debug non-opt build, Highest performance
Original comment by classi...@floodgap.com
on 5 Jan 2012 at 5:06
Can't simplify *o_rc instructions because the assembler sometimes tries to test
multiple conditions (not using _rc or o_ depending on condition leads to
lockups in v8).
Original comment by classi...@floodgap.com
on 5 Jan 2012 at 8:02
Using x_sr_mulli moves score to 805-810.
Original comment by classi...@floodgap.com
on 5 Jan 2012 at 8:46
I tried implementing comment 11. However, it's only BaseIndex that has the
nasty address computation; ImplicitAddress is reg + offset. Most of the time
this boils down to a single lwz or stw, and there is no add32/sub32 against a
BaseIndex. Or did I misunderstand which one you meant?
Original comment by classi...@floodgap.com
on 5 Jan 2012 at 9:19
Implemented Ben's branch suggestion in 6. V8 moves score to 825-830. Now at:
Richards: 961
DeltaBlue: 1382
Crypto: 1909
RayTrace: 518
EarleyBoyer: 847
RegExp: 211
Splay: 1128
----
Score (version 6): 827
Some of these scores seem to have real noise in them, so I don't know how
statistically valid these are. Still, this is ~4% improvement over base 10, and
I know there are improvements over 9.
Original comment by classi...@floodgap.com
on 5 Jan 2012 at 10:09
Ah, I was a bit off in comment 11: it's the add/sub32(TrustedImm32,
AbsoluteAddress) functions that have the issue. (Right now, the load and store
duplicate the li32() of the pointer value.)
Here's a patch with two optimizations we didn't talk about; together they seem
to be worth ~4% on v8/regexp here. I _think_ I got the rldimi (for G5) correct,
but I can't test it. The optimization of or32 to ori+oris (rather than li32+or)
is probably also applicable to and32 and xor32; with some care regarding
signedness of the immediates it may also be applicable to add32.
Original comment by magef...@gmail.com
on 6 Jan 2012 at 4:49
Attachments:
OIC on add/sub.
I'm antsy about mixing in 64-bit instructions in what is essentially a 32-bit
environment. I think just having the G5 eat the cracked instruction for rlwimi
is safer and not a lot slower.
Otherwise I like it. Does it pass all the tests?
Original comment by classi...@floodgap.com
on 6 Jan 2012 at 5:07
I finished the G5 build and we still can't crack that 1050ms, even with all
those changes. So I'm looking at your mini optimizations.
Maybe we're just hitting a wall on an aging platform. :-/
Original comment by classi...@floodgap.com
on 6 Jan 2012 at 5:55
The patch passes at least the jit-tests; I haven't run the full js-tests suite
against it yet.
I don't think we've hit a wall yet, but I think probably the biggest barrier to
performance right now may be branching: since ATM _all_ of our branches are
indirect, we probably blow out even the G5's branch target cache (which is
still only 32 entries, AFAICT). Fortunately, doing the MIPS-style branch
optimization ought to help the branch predictor a _lot_, since the branch
targets will then be visible in the instruction stream.
Another thing we could do is try to be more clever about static hinting.
(Should Overflow always be unlikely?)
Original comment by magef...@gmail.com
on 7 Jan 2012 at 3:04
Unfortunately I had also come to the same conclusion. That's a bigger
refactoring than I want to do right now, especially since it looks like
tracejit took a 15% hit. We can't ship that and I can't fix it, so JM+TI gets
to start in the big game for 10 ESR. But I imagine we'll just have to bite the
bullet.
I'll take your microops modulo rldimi->rlwimi in 10, though.
Original comment by classi...@floodgap.com
on 7 Jan 2012 at 4:25
Couldn't do it for and (or any logical bit test that needs Rc) because it has
to test all the bits at once. But I did add an andis_rc for !(imm.m_value &
0xffff), reworked add32/sub32 and added analogous xori/xoris for xor32. Also
took your unaligned words patch. I think that's all we want to take in 10 -- I
really need to get the beta out by next weekend.
So what's left is reexamining testing FPSCR for FP conversions, and figuring
out a way to handle branching without an insane amount of bookwork. It's harder
than the tracejit because we don't know our destination at the time the branch
is emitted (it gets patched later).
Original comment by classi...@floodgap.com
on 8 Jan 2012 at 2:52
(btw, it all passes Dromaeo, SunSpider and gmake check jstests, so high
confidence of functionality)
Original comment by classi...@floodgap.com
on 8 Jan 2012 at 2:53
One idea that isn't absolutely crazy is to only let relinkJump, maybe linkJump
promote branches to b/bc. Then we don't have to mess with relocateJumps,
because the location of the code has already been finalized in memory.
Promoting calls doesn't seem worth it; there aren't a lot of them and I bet the
majority are to locations out of range (stubs, etc.).
Original comment by classi...@floodgap.com
on 8 Jan 2012 at 3:02
Blah, never mind, the fact that everything can change if an IC gets hot means
we probably can't cheat on this either. This is a pain.
Original comment by classi...@floodgap.com
on 8 Jan 2012 at 3:05
So in terms of branching, I was figuring we'd do more-or-less the same thing
that MIPS does. Here's one possible proposal:
1. when the branch is created, insert
1: b end
2: nop
3: nop
4: bcctr condition (or bcctr ordered-test, for a floating-point branch)
5: nop (or bcctr condition, for a floating-point branch)
end:
Any mcrxr or emulation is before this 5-instruction stanza.
2. Whenever a branch is linked or relinked, look at the last two instructions
to determine the branch condition(s): the second-to-last instruction is
_always_ a bcctr, and the last instruction is either a bcctr or a nop.
3. If the branch offset fits in a bc, patch the first three instructions to
bc condition-from-4, target
bc condition-from-5, target (if 5 is a bcctr) -OR- b end
b end
4. Else, if the branch offset fits in a b, patch the first three instructions to
bc !(condition-from-4), end
bc !(condition-from-5), end (if 5 is a bcctr) -OR- b target
b target
5. Else, patch the first three instructions to
lis r0, hiword(target)
ori r0, loword(target)
mtctr r0
As for calls, if we set the -image_base of libxul to be just above the firefox
executable, it should fit entirely within the low 64M of the address space --
which means that we can try to use absolute branches for stub and native calls,
rather than indirect calls. (And stub calls, at least, are not uncommon.)
As far as relocateJumps goes, all we would need to do is three things:
1. look at the 3rd instruction of the stanza to see what kind of stanza it is,
and then read out the branch target offset from the appropriate place
2. if the branch target for the stanza is within the now-being-relocated code
segment, ignore this entire stanza: it doesn't need relocating
3. figure out the new offset, and repatch with it
I'm going to be really busy for the next two weeks or so, but I can try to get
this working after that, if you like...
Original comment by magef...@gmail.com
on 8 Jan 2012 at 6:11
That would be great, because you seem to have a good handle on what needs to be
done for it (the MIPS code made me dizzy, glad one of us understands it :).
There's no hurry since this won't be in 10.0.0. I'd like to get it into a
10.0.1pre but that's at least three to four weeks away. I'll have changesets up
early next week and I probably won't make any other changes before release
unless another bug is discovered.
Original comment by classi...@floodgap.com
on 8 Jan 2012 at 6:26
Ben's branching rework has landed. V8 on the G5 in reduced power improves by
45%. This is outstanding work and passes all tests in --jittests=amnp,amp (I
will be trying to get the browser to stand up next).
The last outstanding issue with performance other than G3/G4 "fsqrt" (in issue
96) is fixing FP conversion to use FPSCR if possible.
Original comment by classi...@floodgap.com
on 3 Feb 2012 at 5:45
I've got faux-sqrt working in basic form and simplified it somewhat.
I did work on branchConvertDouble etc., and now I remember: even though we
could test the bits, fctiwz. writes to CR1 and failureCases.append() wants a
standard branch which uses CR0. Rather than go a lot of refactoring or creating
a whole separate branching instruction for little gain on a relatively uncommon
operation, I'm just going to leave it. SPARC does essentially the same thing;
they just have a dedicated instruction. If you know different on profiling,
please advise, but this works, at least.
Original comment by classi...@floodgap.com
on 4 Feb 2012 at 4:15
All tests pass with software square root. Been using the browser with the
branchwork for the last couple hours. No crashes, super stable, noticibly
smoother. This is great.
Original comment by classi...@floodgap.com
on 4 Feb 2012 at 11:23
Snapshot against mozilla-release
Original comment by classi...@floodgap.com
on 6 Feb 2012 at 12:52
Attachments:
Mozilla may chemspill on 10 due to an issue that likely affects us as well
(M724284 which is sec-locked). I can't find any serious issues with this or
with the follow on patches, so I am considering shipping them with the
chemspill. I do want to get the G5 back up by "merging" the inline and constant
pool trampoline branching versions, but this can be shipped for the G5 (it's
still better than 10.0.0, just not as good as we know it can get). If there is
objection, speak now; these are atomic patches. Everything passes on the G5 and
G4 test rigs.
Original comment by classi...@floodgap.com
on 7 Feb 2012 at 8:43
I say go for it: I don't expect any difficulty making the trampoline code an
#ifdef option, but I'm not likely to get it done before the weekend -- and I,
for one, will be very happy to get a full browser with the performance
improvements!
Original comment by magef...@gmail.com
on 7 Feb 2012 at 10:12
I think I don't get the point.
As the ESR isn't supposed to receive new features its changeset has to be held
seperately from its first release on.
Now TFF 10 wasn't declared to be the ESR officially (TFF 10 is just said to be
equal to FF 10) - so you could still release another beta (i.e. TFF 10 ESR
beta1) and if that proves stable for the users declare it the official TFF 10
ESR. After all now there's more time for things like that.
Original comment by Tobias.N...@gmail.com
on 8 Feb 2012 at 6:58
After thinking it over, I think we're going to hold the changes into a beta.
Unlike 9 where we shipped methodjit in RC, we had it turned off, but I can't
disable the changes here (they're part of methodjit). We have high confidence
that they will work based on testing, but it's a big change, and this is a
high-priority security release that we want people to adopt. I will ship the
issue 122 and issue 129 fixes, though, because those are very low risk and
Mozilla has already accepted them upstream.
So I'm going to build the chemspill (it has already gone to build) without
soft-sqrt or branch rework, then look at getting the merged version done (if
Ben hasn't done it already) and after that issue a beta for this.
Original comment by classi...@floodgap.com
on 8 Feb 2012 at 2:19
Mozilla is chemspilling for 10.0.2. I'm pretty confident in this now that
people have banged on it some and it passes amp,amnp,amdp,amndp in jit-test, so
we're going to ship it there. I wanted to get a fix in for issue 132 at the
same time but the Mozilla working patch actually really drags down paints, so
we're going to skip that.
Original comment by classi...@floodgap.com
on 17 Feb 2012 at 3:02
Shipped.
Ben and Tobias, anything further to add to this? I think it's time to push this
upstream to Mozilla.
Original comment by classi...@floodgap.com
on 17 Feb 2012 at 6:21
Here's one more patch, which is worth ~8% on sunspider/regexp-dna on my 7450.
It modifies YarrJIT.cpp to add a YARR_AVOID_CHARACTER_INDEX option and adds
support for it to MacroAssemblerPPC.h. This option uses one additional register
to directly track the byte position of the current character, and so enables
the use of Address-based rather than BaseIndex-based addressing in the
generated code. (Other big-RISC chips like MIPS and SPARC may also be
interested in this optimization...)
Looking at the generated code with Shark shows that there are still some
pipeline stalls from the 3-cycle latency of loads on the G4, but the generated
code is at least substantially more compact. (It's still branchy as all hell,
but that's a _big_ project to try to change even if using CR logic is the
PPC-idiomatic way.)
As far as this bug goes, though, I think we're done. Congrats all around, it
turned out really well.
Original comment by magef...@gmail.com
on 20 Feb 2012 at 5:47
Attachments:
Just in case Mozilla doesn't like/want the YarrJIT portion, can that patch
stand on its own without it (modulo the performance improvement)? It looks like
it can. But I like the concept.
Original comment by classi...@floodgap.com
on 20 Feb 2012 at 5:51
I wouldn't even try to put the YarrJIT part in the initial submission to
Mozilla: it should be a separate bz# of its own. But if you're asking whether
things still compile and work without the changes to YarrJIT.cpp, the answer is
that they do.
Original comment by magef...@gmail.com
on 21 Feb 2012 at 2:42
Good. I'll roll it into 10.0.3pre in the meantime.
The server switchover went so smoothly (and required much less time) that I'm
thinking of doing a formal 11 unstable beta, too. #masochist
Original comment by classi...@floodgap.com
on 21 Feb 2012 at 2:44
Apple just finished implementing a totally new macro assembler for
JavaScriptCore:
http://trac.webkit.org/changeset/108444
If that really proves to boost performance mozilla might have interest in it,
too...
Original comment by Tobias.N...@gmail.com
on 22 Feb 2012 at 7:06
Eh, they've got IonMonkey coming along. It's still pretty embryonic but it's
promising. I hope IonMonkey lands after 17 though so that we don't have to
worry about it ourselves.
Anyway, the patch drops regex from 34 to 30, so that's inline with your 8%
estimate. I'll take any wins we can get. :) In for 10.0.3pre.
Original comment by classi...@floodgap.com
on 23 Feb 2012 at 3:28
While prepping the code to ship to Mozilla, I noticed I didn't finish the G5
work on all of the FPR<->GPR routines. This is just a matter of splitting
dispatch groups with nops, so I'll do this for .3RC.
Original comment by classi...@floodgap.com
on 27 Feb 2012 at 4:42
Original issue reported on code.google.com by
classi...@floodgap.com
on 19 Dec 2011 at 5:29