Closed GoogleCodeExporter closed 9 years ago
SunSpider profile, with traces that perform the same or better (unmarked traces
perform worse). Next: see if there is a LIR commonality between the bad traces.
============================================
RESULTS (means and 95% confidence intervals)
--------------------------------------------
Total: 5242.2ms +/- 0.3%
--------------------------------------------
3d: 990.2ms +/- 0.5%
cube: 174.5ms +/- 1.1%
morph: 634.0ms +/- 0.7%
raytrace: 181.7ms +/- 0.9%
access: 1121.7ms +/- 1.0%
binary-trees: 53.8ms +/- 1.9% < same
fannkuch: 702.0ms +/- 1.4%
nbody: 62.2ms +/- 1.8% < better
nsieve: 303.7ms +/- 1.4%
bitops: 812.2ms +/- 1.1%
3bit-bits-in-byte: 56.9ms +/- 1.6% < better
bits-in-byte: 216.8ms +/- 2.0%
bitwise-and: 259.4ms +/- 1.0%
nsieve-bits: 279.1ms +/- 2.1%
controlflow: 64.1ms +/- 0.8%
recursive: 64.1ms +/- 0.8% < same
crypto: 264.3ms +/- 1.1%
aes: 180.4ms +/- 1.4%
md5: 35.4ms +/- 1.7% < better
sha1: 48.5ms +/- 1.0% < better
date: 210.4ms +/- 1.6%
format-tofte: 166.5ms +/- 2.0%
format-xparb: 43.9ms +/- 2.2% < better
math: 732.5ms +/- 0.7%
cordic: 387.9ms +/- 1.1%
partial-sums: 117.1ms +/- 0.8% < same
spectral-norm: 227.5ms +/- 0.5%
regexp: 568.2ms +/- 0.3%
dna: 568.2ms +/- 0.3% < same
string: 478.6ms +/- 0.5%
base64: 92.6ms +/- 1.3%
fasta: 111.3ms +/- 0.9%
tagcloud: 114.7ms +/- 1.2% < same
unpack-code: 100.8ms +/- 0.7% < same
validate-input: 59.2ms +/- 1.1% < same
Original comment by classi...@floodgap.com
on 24 Jan 2011 at 1:36
For comparison,
============================================
RESULTS (means and 95% confidence intervals)
--------------------------------------------
Total: 3377.9ms +/- 0.4%
--------------------------------------------
3d: 425.0ms +/- 0.7%
cube: 157.1ms +/- 0.5%
morph: 144.7ms +/- 1.3%
raytrace: 123.2ms +/- 1.0%
access: 592.4ms +/- 0.3%
binary-trees: 53.3ms +/- 1.7%
fannkuch: 311.1ms +/- 0.3%
nbody: 136.5ms +/- 0.8%
nsieve: 91.5ms +/- 0.7%
bitops: 493.0ms +/- 0.7%
3bit-bits-in-byte: 102.7ms +/- 0.9%
bits-in-byte: 130.3ms +/- 0.7%
bitwise-and: 87.6ms +/- 1.3%
nsieve-bits: 172.4ms +/- 1.5%
controlflow: 64.2ms +/- 0.9%
recursive: 64.2ms +/- 0.9%
crypto: 215.8ms +/- 0.4%
aes: 90.7ms +/- 0.6%
md5: 60.0ms +/- 0.6%
sha1: 65.1ms +/- 1.0%
date: 147.1ms +/- 1.5%
format-tofte: 87.6ms +/- 0.6%
format-xparb: 59.5ms +/- 3.5%
math: 441.3ms +/- 2.5%
cordic: 211.7ms +/- 1.1%
partial-sums: 121.9ms +/- 8.4%
spectral-norm: 107.7ms +/- 0.7%
regexp: 567.1ms +/- 0.2%
dna: 567.1ms +/- 0.2%
string: 432.0ms +/- 0.6%
base64: 63.4ms +/- 1.8%
fasta: 101.6ms +/- 0.8%
tagcloud: 112.1ms +/- 1.7%
unpack-code: 97.8ms +/- 1.0%
validate-input: 57.1ms +/- 1.1%
Original comment by classi...@floodgap.com
on 24 Jan 2011 at 1:37
Analysis of JSOPs that were not used in the same or better tests:
used: ursh
used:
>>> not used: ne
used: ifeq
used: moreiter
used: le
used: not
used: dup2
used: string
used: double
used: trace
used: bindgname
used: bitxor
used: setprop
>>> not used: lineno
>>> not used: uint24
used: eq
used: neg
used: bitor
used: ifne
used: setarg
>>> not used: top
used: one
used: getelem
used: callarg
used: and
used: ge
used: int8
>>> not used: lambda
used: callgname
>>> not used: gnameinc
used: true
>>> not used: getfcslot
used: rop
used: callglobal
used: forlocal
used: bitnot
used: zero
used: enditer
used: getglobal
used: notrace
>>> not used: localdec
>>> not used: prop
used: ng
used: length
used: regexp
used: getthisprop
used: gt
used: initelem
used: pop
>>> not used: deflocalfun
used: mod
used: getlocal
used: bitand
used: false
used: newarray
used: imtop
used: or
used: incgname
used: setlocal
used: getgname
used: new
>>> not used: calllocal
used: this
used: iter
used: getarg
used: lsh
used: null
used: localinc
used: lt
used: push
used: nullblockchain
used: uint16
used: div
used: rsh
used: callprop
>>> not used: nop
used: add
used: callname
used: getlocalprop
used: mul
used: call
used: goto
>>> not used: eval
used: setgname
used: stop
used: getprop
used: setelem
used: return
used: sub
used: endinit
used: inclocal
Original comment by classi...@floodgap.com
on 26 Jan 2011 at 4:33
A sample build with JSOP_GETFCSLOT, JSOP_LAMBDA, JSOP_DEFLOCALFUN, and
JSOP_CALLLOCAL reduced to ARECORD_ABORTED in jstracer.cpp showed dramatically
faster JS across the board. Time to figure out the actual offender of the four
-- or it could be all of them. However, we now have TraceMonkey benching better
than interpreter for the first time on G5!!! Let's do this for beta 11!
Original comment by classi...@floodgap.com
on 26 Jan 2011 at 4:43
Unfortunately the speed was only in debug mode, actual browser performance did
improve but only from 5200 to around 4700. To get a significant win, we need to
be under 3000.
JSOPs audit: the slow ones appear to be JSOP_LINENO (???), _UINT24, _CALLLOCAL
and _GNAMEDEC/INC (LINENO is uncertain because I don't have good testing
coverage for it). The other ops made little difference if on or off, and some
got worse.
The next steps are:
1) Look at the instructions used by the faster ones only, and abort tracing for
the other ops. This may not be possible.
2) These ones seem to have stack issues. Perhaps the stack is the problem, but
I'm not sure yet.
Original comment by classi...@floodgap.com
on 26 Jan 2011 at 9:11
Current set of blacklisted JSOPs: NEG, anything calling setElem, CALLNAME,
LINENO, UINT24, CALLLOCAL, GNAMEDEC, GNAMEINC. This gets us to 3700ms in
SunSpider and wins on both Dromaeo and V8, so this is good enough to ship.
Original comment by classi...@floodgap.com
on 30 Jan 2011 at 4:41
changing flags
Original comment by classi...@floodgap.com
on 31 Jan 2011 at 1:40
On our internal pull, RealClearPolitics has trouble with clicking on links.
This does work in b9 with the nanojit on. Not sure if it's our blacklist or the
interpreter, so making a note to recheck this after our next pull.
Original comment by classi...@floodgap.com
on 31 Jan 2011 at 2:43
Fixed by pull, so conclude Mozilla bug.
Original comment by classi...@floodgap.com
on 3 Feb 2011 at 5:25
Dropping priority as we appear to have reached a maximum for G5.
Original comment by classi...@floodgap.com
on 22 Mar 2011 at 1:09
Here is something interesting, from glibc:
/* long int[r3] __lrint (double x[fp1]) */
ENTRY (__lrint)
stwu r1,-16(r1)
fctiw fp13,fp1
stfd fp13,8(r1)
nop /* Insure the following load is in a different dispatch group */
nop /* to avoid pipe stall on POWER4&5. */
nop
lwz r3,12(r1)
addi r1,r1,16
blr
END (__lrint)
This might be useful for ::asm_d2i -- we could insert some nop()s there.
Original comment by classi...@floodgap.com
on 5 Apr 2011 at 4:51
Other interesting optimizations:
http://sourceware.org/ml/libc-ports/2005-12/msg00004.html
mtctr rTMP /* Power4 wants mtctr 1st in dispatch group */
And they do use the same trick for fctid:
+ENTRY (__llrintf)
+ CALL_MCOUNT
+ fctid fp13,fp1
+ stfd fp13,-8(r1)
+ nop /* Insure the following load is in a different dispatch group */
+ nop /* to avoid pipe stall on POWER4&5. */
+ nop
+ lwz r3,-8(r1)
+ lwz r4,-4(r1)
+ blr
+ END (__llrintf)
Original comment by classi...@floodgap.com
on 5 Apr 2011 at 5:14
And we also need to get MCRXR out of the nanojit, it is NOT native on G5! Argh!
No wonder the G4 runs rings around it! We should replace it with equivalent
mtxer and mfxer (i.e, mfspr rT,1 and mtspr 1,rT) for G5. Something like
+ mfxer Rx
+ mtcrf 0, Rx
and, if we need the XER cleared (we probably should),
+ rlwinm Rx,Rx,0,0,28
+ mtxer Rx
should work ...
http://www.macintouch.com/tiger20.html and from Common Lisp,
(in-package "CCL")
(defppclapfunction do-mcrxr ((n arg_z))
loop
(cmpwi :cr1 arg_z '1)
(mcrxr 0)
(subi arg_z arg_z '1)
(bge :cr1 loop)
(blr))
(defppclapfunction do-mtxer ((n arg_z))
loop
(cmpwi :cr1 arg_z '1)
(mtxer rzero)
(subi arg_z arg_z '1)
(bge :cr1 loop)
(blr))
;;; (time (do-mcrxr 100000000))
;;; (time (do-mtxer 100000000))
Original comment by classi...@floodgap.com
on 5 Apr 2011 at 5:45
Apple code implies that mtcrf is okay for individual CR fields, iff it is one
bitfield. From
http://www.opensource.apple.com/source/xnu/xnu-1456.1.26/osfmk/ppc/bcopy.s
shortcopy:
cmplw r12,r5 ; must move reverse if (dest-source)<length
mtcrf 2,r5 ; move length to cr6 and cr7 one at a time...
mtcrf 1,r5 ; ...which is faster on G4 and G5
bge++ backend ; handle forward moves (most common case)
add r6,r6,r5 ; point one past end of operands in reverse moves
add r4,r4,r5
b bbackend ; handle reverse moves
Although the modified code should work for G4/G3, we will keep mcrxr on those
systems to reduce icache pressure.
Original comment by classi...@floodgap.com
on 5 Apr 2011 at 6:04
BASE
Richards: 144
DeltaBlue: 210
Crypto: 107
RayTrace: 407
EarleyBoyer: 521
----
Score: 233
SunSpider now 1760
MTCRF (swapon)
Richards: 1551
DeltaBlue: 479
Crypto: 915
RayTrace: 351
EarleyBoyer: 478
----
Score: 648
MTCRF (swapoff)
Richards: 1560
DeltaBlue: 480
Crypto: 910
RayTrace: 352
EarleyBoyer: 479
----
Score: 649
MCRXR (swapon)
Richards: 679
DeltaBlue: 318
Crypto: 22.3
RayTrace: 344
EarleyBoyer: 461
----
Score: 238
We keep the swap. We lose the mcrxr for G5. Everybody wins.
VERIFIED
Original comment by classi...@floodgap.com
on 8 Apr 2011 at 10:00
Original issue reported on code.google.com by
classi...@floodgap.com
on 18 Jan 2011 at 7:11