Make the MacroAssembler performance awesome

GoogleCodeExporter commented 9 years ago

Nanojit was highly optimized and only some of those optimizations are in the 
methodjit as shipped in 9 (the G5 ones primarily). Here are some ideas:

- Strength reduce mulli into "x_sr_mulli" using adds and/or shifts, as we do in 
the methodjit. At the same time turn branchMul32 with an Imm32 into a mulli and 
eliminate the move(). Use the assembler, not macroassembler, to do the strength 
reduction.

- In branchSub32, add the code to see if the cond is not an Overflow test. If 
it is not, use addic_rc and eliminate the move().

- Ditch bool supportsFloatingPointTruncate() const { return false; } ... we do 
have a truncate after all. Investigate what we have to do to get that working.

These may require some significant refactoring:

- See if clearing SO prior to an op and doing branches on SO is faster than the 
equivalent mcrxr and branching on OV. It might be not in G3/G4. It probably is 
on G5, because it reduces use of microcoded instructions and will definitely 
shorten branch size and get better cache performance. However, this will gut 
branches. We might simply keep separate branch ops altogether for G5 vs G3/G4 
(G5 uses SO, G3/G4 uses mcrxr-OV).

- Since we aren't using all the registers, pre-load the constants used for 
float conversion and truncation into the excess GPRs, saving us some loads. The 
downside is that we need bigger stack frames and will always load the constants 
even in situations where we never reference those registers, so the actual 
improvement may be minimal if any. Type inference may make this even less 
profitable by completely eliminating a float codepath.

Original issue reported on code.google.com by classi...@floodgap.com on 19 Dec 2011 at 5:29

GoogleCodeExporter commented 9 years ago

Shifting for mulli probably will require an extsw at the end to extend the sign.

Also, this is handy:
http://cellperformance.beyond3d.com/articles/2006/04/avoiding-microcoded-instruc
tions-on-the-ppu.html

Original comment by classi...@floodgap.com on 19 Dec 2011 at 5:32

GoogleCodeExporter commented 9 years ago

In fact, look at the conds in general in branch[MathOP]32s. If they are 
overflow only, do NOT use the _rc variant as it will be generally slower on G5 
(many are microcoded).

Original comment by classi...@floodgap.com on 19 Dec 2011 at 5:33

GoogleCodeExporter commented 9 years ago

Thinking about it, branchMul32 needs to have the move() for mullwo if the cond 
is Overflow, so it can't use x_sr_mulli. :(

Original comment by classi...@floodgap.com on 19 Dec 2011 at 5:36

GoogleCodeExporter commented 9 years ago

- We can actually preload the constants for float conversion/truncation into 
callee-saved FPRs (of which we have even more going unused), and do it in the 
Trampoline rather than in each stack frame.

  -- but there may be one or two extra-common "immediates" that we could cache in
     GPRs? (hard to say what they would be, though...)

- Can we use the FI (fraction inexact) bit (FPSCR[38]) to avoid the 
song-and-dance in the the double-to-int conversion?

- I think we can move 4 of the 5 instructions in the mcrxr emulation to 
_before_ the lis/ori/mtctr in m_branch, so that other branches don't all need 4 
nops on G5? It'll make the code uglier, but not too much so.

- I wonder how often BaseIndex operands have a Scale of TimesOne. We should 
probably elide the shift if it's common enough to make up for the compilation 
time hit.

- How much do unconditional branches cost? Is it worth optimizing branches by 
trying to convert the lis/ori/mtctr to a bc() (then possibly the last 
instruction of the mcrxr emulation) and then a b() over the rest of the branch 
block?

Original comment by magef...@gmail.com on 19 Dec 2011 at 11:42

GoogleCodeExporter commented 9 years ago

Figuring out common immediates requires a degree of profiling. That said, zero 
is an obvious one; there is no hardcoded register 0 in the PPC ISA like there 
is in SPARC or MIPS. So that might work.

Testing bits in FPSCR is absolutely insane. You have to somehow get the bit you 
want into a CR for branching and it's not easy. That was my first instinct and 
the code I wrote was such a mess that I went with the current approach. If you 
can think of an easy way to do it, though, I'm all ears.

Jumps must be fixed length or patching will fail, even unconditional ones. Any 
change you make to any type of jump must be made for all, conditional or 
otherwise. This is the simplest way I came up with, but write out what you 
suggest being a replacement. I'm not sure I quite understand your last 
suggestion, though.

BaseIndex is almost never TimesOne.

Original comment by classi...@floodgap.com on 19 Dec 2011 at 11:54

GoogleCodeExporter commented 9 years ago

- Zero (and 1, and -1) are all rather uninteresting: it's unlikely that a 
register-to-register move is any faster than an addi rN,r0,#immediate.

- What's wrong with using mcrfs 
(http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%2Fcom.ibm.ai
x.aixassem%2Fdoc%2Falangref%2Fmffs.htm) to take the FI bit from the FPSCR and 
drop it into a CR?

- Here's my proposal for G5 jumps:

// the 1st four instructions of the emulated mcrxr:
  x_mfxer
  rlwinm
  mtcrf
  rlwinm

// jumps that don't need mcrxr _start_ here:
  lis
  ori
  mtctr
  x_mtxer // or x_nop if no mcrxr or the 1st branch in m_fbranch or the mcrxr on <=G4
  bcctr
JmpSrc:

So now PPC_JUMP_BACKWIND is only 20 on even G5. 4 of the mcrxr emulation 
instructions no longer need to be thought of as part of the jump.

- My last proposal was

  lis
  ori
  mtctr
  x_nop or mcrxr or branch
  bcctr

gets patched if possible to

  bc (or mcrxr if needed)
  j 1f (or bc if we had a mcrxr or need two conditional branches as in m_fbranch)
  j 1f (or could leave as mtctr if previous instruction is j 1f)
  x_nop or mcrxr or branch
  bcctr
1f:

. Although

  bc (or mcrxr)
  x_nop (or bc)
  x_nop
  x_nop
  x_nop

might be faster, depending how much unconditional jumps cost. Basically this is 
the same type of optimization that MIPS does.

Original comment by magef...@gmail.com on 20 Dec 2011 at 12:48

GoogleCodeExporter commented 9 years ago

Your jumps idea doesn't account for the case that jumps exist in an unpatched 
"skip state" first, unless I'm missing something. It is entirely possible, and 
does happen, that a jump will be emitted that remains unpatched for later. We 
do know the condition being tested for, so we can decide to emit the mcrxr 
equivalents or not, but we're still executing it.

Also, submit patches :P but not yet because I need to get type inference 
working before we can do anything else.

I seem to recall I hit some glitch testing FI that way, but I don't have the 
code around anymore, so I'll have to look at that again.

Original comment by classi...@floodgap.com on 20 Dec 2011 at 1:16

GoogleCodeExporter commented 9 years ago

(also, I couldn't tell: where is the second branch coming off from in fbranch?)

Original comment by classi...@floodgap.com on 20 Dec 2011 at 1:18

GoogleCodeExporter commented 9 years ago

I'll certainly try to submit my ideas as patches, though since I only have a G4 
I can't do much G5-specific performance testing... ^_-

I don't see why the skip state would be a problem: any code following the jump 
has to tolerate either the mcrxr being executed or not, so partially emulating 
it (which has no side effects that fully executing it doesn't) shouldn't be a 
problem. The lis/ori can still start as a nop/x_skip_this_branch. (Though it's 
probably faster if the x_skip_this_branch is first so that it even skips the 
nop.)

The first branch in fbranch is the ordered/unordered test, which occupies the 
instruction slot before the bcctr in the normal branch stanza; then there is 
the branch for the actual numerical relation. Is that what you were asking?

Original comment by magef...@gmail.com on 20 Dec 2011 at 1:49

GoogleCodeExporter commented 9 years ago

One possible use for a zero register is that it makes testing XER[CA] easy, via 
adde. We could stregth-reduce overflow-checking addition and subtraction to 
carrying math:

  addo
  mcrxr
  bc[overflow]

would become

  addc
  adde tempRegister,zeroRegister,zeroRegister
  bc[nonzero]

which strength-reduces the mcrxr to an adde. Unfortunately, multiplication 
still requires checking XER[OV].

Original comment by magef...@gmail.com on 20 Dec 2011 at 3:02

GoogleCodeExporter commented 9 years ago

An easy one: functions like add32(TrustedImm32, [Absolute]Address) currently do 
the address generation twice, once in load32() and once in store32(). It should 
be trivial to simplify that to only calculate the memory address once.

Original comment by magef...@gmail.com on 25 Dec 2011 at 11:14

GoogleCodeExporter commented 9 years ago

So now that type inference looks like a win, let's break down the ideas.

a) Ideas that don't require modifying the VMFrame or significantly refactoring 
code (easy wins)
b) Ideas that require refactoring code, such as branches, etc. (not so easy)
c) Ideas that require modifying the VMFrame (scary: modifying the frame layout 
can have unexpected side effects and the trampoline is fragile if offsets 
change)
d) b+c (scary and not so easy)

a: simplify *(Imm32, Address) as suggested by Ben in comment 11; 
strength-reduced x_sr_mulli (modulo overflow which must use mullwo); don't use 
*_rc if the condition is Overflow; inline-able sqrt equivalent for G3/G4 
(assuming we can do this with temp FP registers and don't need to spill them to 
the VMFrame -- we could spill at most one to the temporary FP space at r1+16 in 
the linkage area). We are already using truncate FP instructions in JM+TI, so 
that is already done. Others in this category?

b: move mcrxr_equiv instructions out of the branch, needs to still keep branch 
sizes the same; look again at FPSCR for inexact or failed floating point 
conversion (trying to find my notes on what went wrong with this before)

c: zero register and other constant registers because we have to save the zero 
register somewhere and we are using all the GPRs that don't need to be saved, 
so we must widen the stack frame. we might be able to have a zero FPU register 
though, and a zero FP reg is 64 zero bits, but we don't seem to use the FPU 
much.

d: use zero register for branch testing

Others?

The target speed is to get the quad G5 below 1000 ms (currently with JM+TI, 
1050) and my reference 1GHz 7450 below 2500ms (currently 2750). I think these 
goals are definitely attainable. Once 9.0.1pre is banged on and found to be 
solid, then it is time to optimize the shet out of the MacroAssembler.

Original comment by classi...@floodgap.com on 27 Dec 2011 at 9:21

GoogleCodeExporter commented 9 years ago

WRT spilling FPRs: since we're inlining the code and therefore by definition do 
not call any functions within it, I think we can actually spill to the red zone 
(http://developer.apple.com/library/mac/#documentation/DeveloperTools/Conceptual
/LowLevelABI/100-32-bit_PowerPC_Function_Calling_Conventions/32bitPowerPC.html#/
/apple_ref/doc/uid/TP40002438-SW20) -- which gives us plenty of space if we 
need it.

Original comment by magef...@gmail.com on 27 Dec 2011 at 11:11

GoogleCodeExporter commented 9 years ago

I'll buy that (although we should mark such sections of code #ifdef 
JS_CPU_PPC_OSX because Linux PPC doesn't have a red zone).

Original comment by classi...@floodgap.com on 27 Dec 2011 at 11:14

GoogleCodeExporter commented 9 years ago

Should we use emergencyTempRegister in add/sub32(Imm32, Address) rather than 
special-casing -- and pessimizing -- tempRegister in add/sub32(Imm32, 
RegisterID)? (We can leave an ASSERT(dest != tempRgister) for safety.)

Original comment by magef...@gmail.com on 27 Dec 2011 at 11:26

GoogleCodeExporter commented 9 years ago

I guess we could. It's unconventional but it does eliminate some of the 
irritating limitations of r0 and it doesn't have to be saved.

Original comment by classi...@floodgap.com on 27 Dec 2011 at 11:29

GoogleCodeExporter commented 9 years ago

(and it does look like it's okay with Linux PPC, so it's probably ABI-safe in 
general)

Original comment by classi...@floodgap.com on 27 Dec 2011 at 11:30

GoogleCodeExporter commented 9 years ago

Starting point for a)

Richards: 892
DeltaBlue: 1304
Crypto: 1897
RayTrace: 482
EarleyBoyer: 857
RegExp: 198
Splay: 1153
----
Score (version 6): 799

on the quad in debug non-opt build, Highest performance

Original comment by classi...@floodgap.com on 5 Jan 2012 at 5:06

GoogleCodeExporter commented 9 years ago

Can't simplify *o_rc instructions because the assembler sometimes tries to test 
multiple conditions (not using _rc or o_ depending on condition leads to 
lockups in v8).

Original comment by classi...@floodgap.com on 5 Jan 2012 at 8:02

GoogleCodeExporter commented 9 years ago

Using x_sr_mulli moves score to 805-810.

Original comment by classi...@floodgap.com on 5 Jan 2012 at 8:46

GoogleCodeExporter commented 9 years ago

I tried implementing comment 11. However, it's only BaseIndex that has the 
nasty address computation; ImplicitAddress is reg + offset. Most of the time 
this boils down to a single lwz or stw, and there is no add32/sub32 against a 
BaseIndex. Or did I misunderstand which one you meant?

Original comment by classi...@floodgap.com on 5 Jan 2012 at 9:19

GoogleCodeExporter commented 9 years ago

Implemented Ben's branch suggestion in 6. V8 moves score to 825-830. Now at:

Richards: 961
DeltaBlue: 1382
Crypto: 1909
RayTrace: 518
EarleyBoyer: 847
RegExp: 211
Splay: 1128
----
Score (version 6): 827

Some of these scores seem to have real noise in them, so I don't know how 
statistically valid these are. Still, this is ~4% improvement over base 10, and 
I know there are improvements over 9.

Original comment by classi...@floodgap.com on 5 Jan 2012 at 10:09

GoogleCodeExporter commented 9 years ago

Ah, I was a bit off in comment 11: it's the add/sub32(TrustedImm32, 
AbsoluteAddress) functions that have the issue. (Right now, the load and store 
duplicate the li32() of the pointer value.)

Here's a patch with two optimizations we didn't talk about; together they seem 
to be worth ~4% on v8/regexp here. I _think_ I got the rldimi (for G5) correct, 
but I can't test it. The optimization of or32 to ori+oris (rather than li32+or) 
is probably also applicable to and32 and xor32; with some care regarding 
signedness of the immediates it may also be applicable to add32.

Original comment by magef...@gmail.com on 6 Jan 2012 at 4:49

Attachments:

regexp-microoptimizations.patch

GoogleCodeExporter commented 9 years ago

OIC on add/sub.

I'm antsy about mixing in 64-bit instructions in what is essentially a 32-bit 
environment. I think just having the G5 eat the cracked instruction for rlwimi 
is safer and not a lot slower.

Otherwise I like it. Does it pass all the tests?

Original comment by classi...@floodgap.com on 6 Jan 2012 at 5:07

GoogleCodeExporter commented 9 years ago

I finished the G5 build and we still can't crack that 1050ms, even with all 
those changes. So I'm looking at your mini optimizations.

Maybe we're just hitting a wall on an aging platform. :-/

Original comment by classi...@floodgap.com on 6 Jan 2012 at 5:55

GoogleCodeExporter commented 9 years ago

The patch passes at least the jit-tests; I haven't run the full js-tests suite 
against it yet.

I don't think we've hit a wall yet, but I think probably the biggest barrier to 
performance right now may be branching: since ATM _all_ of our branches are 
indirect, we probably blow out even the G5's branch target cache (which is 
still only 32 entries, AFAICT). Fortunately, doing the MIPS-style branch 
optimization ought to help the branch predictor a _lot_, since the branch 
targets will then be visible in the instruction stream.

Another thing we could do is try to be more clever about static hinting. 
(Should Overflow always be unlikely?)

Original comment by magef...@gmail.com on 7 Jan 2012 at 3:04

GoogleCodeExporter commented 9 years ago

Unfortunately I had also come to the same conclusion. That's a bigger 
refactoring than I want to do right now, especially since it looks like 
tracejit took a 15% hit. We can't ship that and I can't fix it, so JM+TI gets 
to start in the big game for 10 ESR. But I imagine we'll just have to bite the 
bullet.

I'll take your microops modulo rldimi->rlwimi in 10, though.

Original comment by classi...@floodgap.com on 7 Jan 2012 at 4:25

GoogleCodeExporter commented 9 years ago

Couldn't do it for and (or any logical bit test that needs Rc) because it has 
to test all the bits at once. But I did add an andis_rc for !(imm.m_value & 
0xffff), reworked add32/sub32 and added analogous xori/xoris for xor32. Also 
took your unaligned words patch. I think that's all we want to take in 10 -- I 
really need to get the beta out by next weekend.

So what's left is reexamining testing FPSCR for FP conversions, and figuring 
out a way to handle branching without an insane amount of bookwork. It's harder 
than the tracejit because we don't know our destination at the time the branch 
is emitted (it gets patched later).

Original comment by classi...@floodgap.com on 8 Jan 2012 at 2:52

GoogleCodeExporter commented 9 years ago

(btw, it all passes Dromaeo, SunSpider and gmake check jstests, so high 
confidence of functionality)

Original comment by classi...@floodgap.com on 8 Jan 2012 at 2:53

GoogleCodeExporter commented 9 years ago

One idea that isn't absolutely crazy is to only let relinkJump, maybe linkJump 
promote branches to b/bc. Then we don't have to mess with relocateJumps, 
because the location of the code has already been finalized in memory. 
Promoting calls doesn't seem worth it; there aren't a lot of them and I bet the 
majority are to locations out of range (stubs, etc.).

Original comment by classi...@floodgap.com on 8 Jan 2012 at 3:02

GoogleCodeExporter commented 9 years ago

Blah, never mind, the fact that everything can change if an IC gets hot means 
we probably can't cheat on this either. This is a pain.

Original comment by classi...@floodgap.com on 8 Jan 2012 at 3:05

GoogleCodeExporter commented 9 years ago

So in terms of branching, I was figuring we'd do more-or-less the same thing 
that MIPS does. Here's one possible proposal:

1. when the branch is created, insert

1:  b end
2:  nop
3:  nop
4:  bcctr condition (or bcctr ordered-test, for a floating-point branch)
5:  nop (or bcctr condition, for a floating-point branch)
end:

Any mcrxr or emulation is before this 5-instruction stanza.

2. Whenever a branch is linked or relinked, look at the last two instructions 
to determine the branch condition(s): the second-to-last instruction is 
_always_ a bcctr, and the last instruction is either a bcctr or a nop.

3. If the branch offset fits in a bc, patch the first three instructions to

    bc condition-from-4, target
    bc condition-from-5, target (if 5 is a bcctr) -OR- b end
    b end

4. Else, if the branch offset fits in a b, patch the first three instructions to

    bc !(condition-from-4), end
    bc !(condition-from-5), end (if 5 is a bcctr) -OR- b target
    b target

5. Else, patch the first three instructions to

   lis r0, hiword(target)
   ori r0, loword(target)
   mtctr r0

As for calls, if we set the -image_base of libxul to be just above the firefox 
executable, it should fit entirely within the low 64M of the address space -- 
which means that we can try to use absolute branches for stub and native calls, 
rather than indirect calls. (And stub calls, at least, are not uncommon.)

As far as relocateJumps goes, all we would need to do is three things:
1. look at the 3rd instruction of the stanza to see what kind of stanza it is,
   and then read out the branch target offset from the appropriate place
2. if the branch target for the stanza is within the now-being-relocated code 
segment, ignore this entire stanza: it doesn't need relocating
3. figure out the new offset, and repatch with it

I'm going to be really busy for the next two weeks or so, but I can try to get 
this working after that, if you like...

Original comment by magef...@gmail.com on 8 Jan 2012 at 6:11

GoogleCodeExporter commented 9 years ago

That would be great, because you seem to have a good handle on what needs to be 
done for it (the MIPS code made me dizzy, glad one of us understands it :). 
There's no hurry since this won't be in 10.0.0. I'd like to get it into a 
10.0.1pre but that's at least three to four weeks away. I'll have changesets up 
early next week and I probably won't make any other changes before release 
unless another bug is discovered.

Original comment by classi...@floodgap.com on 8 Jan 2012 at 6:26

GoogleCodeExporter commented 9 years ago

Ben's branching rework has landed. V8 on the G5 in reduced power improves by 
45%. This is outstanding work and passes all tests in --jittests=amnp,amp (I 
will be trying to get the browser to stand up next).

The last outstanding issue with performance other than G3/G4 "fsqrt" (in issue 
96) is fixing FP conversion to use FPSCR if possible.

Original comment by classi...@floodgap.com on 3 Feb 2012 at 5:45

GoogleCodeExporter commented 9 years ago

I've got faux-sqrt working in basic form and simplified it somewhat.

I did work on branchConvertDouble etc., and now I remember: even though we 
could test the bits, fctiwz. writes to CR1 and failureCases.append() wants a 
standard branch which uses CR0. Rather than go a lot of refactoring or creating 
a whole separate branching instruction for little gain on a relatively uncommon 
operation, I'm just going to leave it. SPARC does essentially the same thing; 
they just have a dedicated instruction. If you know different on profiling, 
please advise, but this works, at least.

Original comment by classi...@floodgap.com on 4 Feb 2012 at 4:15

GoogleCodeExporter commented 9 years ago

All tests pass with software square root. Been using the browser with the 
branchwork for the last couple hours. No crashes, super stable, noticibly 
smoother. This is great.

Original comment by classi...@floodgap.com on 4 Feb 2012 at 11:23

Changed state: Started

GoogleCodeExporter commented 9 years ago

Snapshot against mozilla-release

Original comment by classi...@floodgap.com on 6 Feb 2012 at 12:52

Attachments:

TenFourFox10-changesets-20120205.zip

GoogleCodeExporter commented 9 years ago

Mozilla may chemspill on 10 due to an issue that likely affects us as well 
(M724284 which is sec-locked). I can't find any serious issues with this or 
with the follow on patches, so I am considering shipping them with the 
chemspill. I do want to get the G5 back up by "merging" the inline and constant 
pool trampoline branching versions, but this can be shipped for the G5 (it's 
still better than 10.0.0, just not as good as we know it can get). If there is 
objection, speak now; these are atomic patches. Everything passes on the G5 and 
G4 test rigs.

Original comment by classi...@floodgap.com on 7 Feb 2012 at 8:43

GoogleCodeExporter commented 9 years ago

I say go for it: I don't expect any difficulty making the trampoline code an 
#ifdef option, but I'm not likely to get it done before the weekend -- and I, 
for one, will be very happy to get a full browser with the performance 
improvements!

Original comment by magef...@gmail.com on 7 Feb 2012 at 10:12

GoogleCodeExporter commented 9 years ago

I think I don't get the point. 
As the ESR isn't supposed to receive new features its changeset has to be held 
seperately from its first release on.

Now TFF 10 wasn't declared to be the ESR officially (TFF 10 is just said to be 
equal to FF 10) - so you could still release another beta (i.e. TFF 10 ESR 
beta1) and if that proves stable for the users declare it the official TFF 10 
ESR. After all now there's more time for things like that.

Original comment by Tobias.N...@gmail.com on 8 Feb 2012 at 6:58

GoogleCodeExporter commented 9 years ago

After thinking it over, I think we're going to hold the changes into a beta. 
Unlike 9 where we shipped methodjit in RC, we had it turned off, but I can't 
disable the changes here (they're part of methodjit). We have high confidence 
that they will work based on testing, but it's a big change, and this is a 
high-priority security release that we want people to adopt. I will ship the 
issue 122 and issue 129 fixes, though, because those are very low risk and 
Mozilla has already accepted them upstream.

So I'm going to build the chemspill (it has already gone to build) without 
soft-sqrt or branch rework, then look at getting the merged version done (if 
Ben hasn't done it already) and after that issue a beta for this.

Original comment by classi...@floodgap.com on 8 Feb 2012 at 2:19

GoogleCodeExporter commented 9 years ago

Mozilla is chemspilling for 10.0.2. I'm pretty confident in this now that 
people have banged on it some and it passes amp,amnp,amdp,amndp in jit-test, so 
we're going to ship it there. I wanted to get a fix in for issue 132 at the 
same time but the Mozilla working patch actually really drags down paints, so 
we're going to skip that.

Original comment by classi...@floodgap.com on 17 Feb 2012 at 3:02

GoogleCodeExporter commented 9 years ago

Shipped.

Ben and Tobias, anything further to add to this? I think it's time to push this 
upstream to Mozilla.

Original comment by classi...@floodgap.com on 17 Feb 2012 at 6:21

GoogleCodeExporter commented 9 years ago

Here's one more patch, which is worth ~8% on sunspider/regexp-dna on my 7450. 
It modifies YarrJIT.cpp to add a YARR_AVOID_CHARACTER_INDEX option and adds 
support for it to MacroAssemblerPPC.h. This option uses one additional register 
to directly track the byte position of the current character, and so enables 
the use of Address-based rather than BaseIndex-based addressing in the 
generated code. (Other big-RISC chips like MIPS and SPARC may also be 
interested in this optimization...)

Looking at the generated code with Shark shows that there are still some 
pipeline stalls from the 3-cycle latency of loads on the G4, but the generated 
code is at least substantially more compact. (It's still branchy as all hell, 
but that's a _big_ project to try to change even if using CR logic is the 
PPC-idiomatic way.)

As far as this bug goes, though, I think we're done. Congrats all around, it 
turned out really well.

Original comment by magef...@gmail.com on 20 Feb 2012 at 5:47

Attachments:

yarr-avoid-character-index.patch

GoogleCodeExporter commented 9 years ago

Just in case Mozilla doesn't like/want the YarrJIT portion, can that patch 
stand on its own without it (modulo the performance improvement)? It looks like 
it can. But I like the concept.

Original comment by classi...@floodgap.com on 20 Feb 2012 at 5:51

GoogleCodeExporter commented 9 years ago

I wouldn't even try to put the YarrJIT part in the initial submission to 
Mozilla: it should be a separate bz# of its own. But if you're asking whether 
things still compile and work without the changes to YarrJIT.cpp, the answer is 
that they do.

Original comment by magef...@gmail.com on 21 Feb 2012 at 2:42

GoogleCodeExporter commented 9 years ago

Good. I'll roll it into 10.0.3pre in the meantime.

The server switchover went so smoothly (and required much less time) that I'm 
thinking of doing a formal 11 unstable beta, too. #masochist

Original comment by classi...@floodgap.com on 21 Feb 2012 at 2:44

GoogleCodeExporter commented 9 years ago

Apple just finished implementing a totally new macro assembler for 
JavaScriptCore:
http://trac.webkit.org/changeset/108444

If that really proves to boost performance mozilla might have interest in it, 
too...

Original comment by Tobias.N...@gmail.com on 22 Feb 2012 at 7:06

GoogleCodeExporter commented 9 years ago

Eh, they've got IonMonkey coming along. It's still pretty embryonic but it's 
promising. I hope IonMonkey lands after 17 though so that we don't have to 
worry about it ourselves.

Anyway, the patch drops regex from 34 to 30, so that's inline with your 8% 
estimate. I'll take any wins we can get. :) In for 10.0.3pre.

Original comment by classi...@floodgap.com on 23 Feb 2012 at 3:28

GoogleCodeExporter commented 9 years ago

While prepping the code to ship to Mozilla, I noticed I didn't finish the G5 
work on all of the FPR<->GPR routines. This is just a matter of splitting 
dispatch groups with nops, so I'll do this for .3RC.

Original comment by classi...@floodgap.com on 27 Feb 2012 at 4:42

classilla / tenfourfox

Make the MacroAssembler performance awesome #118