G5-specific nanojit profiling

GoogleCodeExporter commented 9 years ago

Find those operations that _are_ faster on G5 with the nanojit. Dromaeo sans 
SunSpider is a win even for G5, so we know they exist. Spun off issue 20.

Original issue reported on code.google.com by classi...@floodgap.com on 18 Jan 2011 at 7:11

GoogleCodeExporter commented 9 years ago

SunSpider profile, with traces that perform the same or better (unmarked traces 
perform worse). Next: see if there is a LIR commonality between the bad traces.

============================================
RESULTS (means and 95% confidence intervals)
--------------------------------------------
Total:                  5242.2ms +/- 0.3%
--------------------------------------------

  3d:                    990.2ms +/- 0.5%
    cube:                174.5ms +/- 1.1%
    morph:               634.0ms +/- 0.7%
    raytrace:            181.7ms +/- 0.9%

  access:               1121.7ms +/- 1.0%
    binary-trees:         53.8ms +/- 1.9% < same
    fannkuch:            702.0ms +/- 1.4%
    nbody:                62.2ms +/- 1.8% < better
    nsieve:              303.7ms +/- 1.4%

  bitops:                812.2ms +/- 1.1%
    3bit-bits-in-byte:    56.9ms +/- 1.6% < better
    bits-in-byte:        216.8ms +/- 2.0%
    bitwise-and:         259.4ms +/- 1.0%
    nsieve-bits:         279.1ms +/- 2.1%

  controlflow:            64.1ms +/- 0.8%
    recursive:            64.1ms +/- 0.8% < same

  crypto:                264.3ms +/- 1.1%
    aes:                 180.4ms +/- 1.4%
    md5:                  35.4ms +/- 1.7% < better
    sha1:                 48.5ms +/- 1.0% < better

  date:                  210.4ms +/- 1.6%
    format-tofte:        166.5ms +/- 2.0%
    format-xparb:         43.9ms +/- 2.2% < better

  math:                  732.5ms +/- 0.7%
    cordic:              387.9ms +/- 1.1%
    partial-sums:        117.1ms +/- 0.8% < same
    spectral-norm:       227.5ms +/- 0.5%

  regexp:                568.2ms +/- 0.3%
    dna:                 568.2ms +/- 0.3% < same

  string:                478.6ms +/- 0.5%
    base64:               92.6ms +/- 1.3%
    fasta:               111.3ms +/- 0.9%
    tagcloud:            114.7ms +/- 1.2% < same
    unpack-code:         100.8ms +/- 0.7% < same
    validate-input:       59.2ms +/- 1.1% < same

Original comment by classi...@floodgap.com on 24 Jan 2011 at 1:36

GoogleCodeExporter commented 9 years ago

For comparison, 

============================================
RESULTS (means and 95% confidence intervals)
--------------------------------------------
Total:                 3377.9ms +/- 0.4%
--------------------------------------------

  3d:                   425.0ms +/- 0.7%
    cube:               157.1ms +/- 0.5%
    morph:              144.7ms +/- 1.3%
    raytrace:           123.2ms +/- 1.0%

  access:               592.4ms +/- 0.3%
    binary-trees:        53.3ms +/- 1.7%
    fannkuch:           311.1ms +/- 0.3%
    nbody:              136.5ms +/- 0.8%
    nsieve:              91.5ms +/- 0.7%

  bitops:               493.0ms +/- 0.7%
    3bit-bits-in-byte:  102.7ms +/- 0.9%
    bits-in-byte:       130.3ms +/- 0.7%
    bitwise-and:         87.6ms +/- 1.3%
    nsieve-bits:        172.4ms +/- 1.5%

  controlflow:           64.2ms +/- 0.9%
    recursive:           64.2ms +/- 0.9%

  crypto:               215.8ms +/- 0.4%
    aes:                 90.7ms +/- 0.6%
    md5:                 60.0ms +/- 0.6%
    sha1:                65.1ms +/- 1.0%

  date:                 147.1ms +/- 1.5%
    format-tofte:        87.6ms +/- 0.6%
    format-xparb:        59.5ms +/- 3.5%

  math:                 441.3ms +/- 2.5%
    cordic:             211.7ms +/- 1.1%
    partial-sums:       121.9ms +/- 8.4%
    spectral-norm:      107.7ms +/- 0.7%

  regexp:               567.1ms +/- 0.2%
    dna:                567.1ms +/- 0.2%

  string:               432.0ms +/- 0.6%
    base64:              63.4ms +/- 1.8%
    fasta:              101.6ms +/- 0.8%
    tagcloud:           112.1ms +/- 1.7%
    unpack-code:         97.8ms +/- 1.0%
    validate-input:      57.1ms +/- 1.1%

Original comment by classi...@floodgap.com on 24 Jan 2011 at 1:37

GoogleCodeExporter commented 9 years ago

Analysis of JSOPs that were not used in the same or better tests:

used: ursh
used: 
>>> not used: ne
used: ifeq
used: moreiter
used: le
used: not
used: dup2
used: string
used: double
used: trace
used: bindgname
used: bitxor
used: setprop
>>> not used: lineno
>>> not used: uint24
used: eq
used: neg
used: bitor
used: ifne
used: setarg
>>> not used: top
used: one
used: getelem
used: callarg
used: and
used: ge
used: int8
>>> not used: lambda
used: callgname
>>> not used: gnameinc
used: true
>>> not used: getfcslot
used: rop
used: callglobal
used: forlocal
used: bitnot
used: zero
used: enditer
used: getglobal
used: notrace
>>> not used: localdec
>>> not used: prop
used: ng
used: length
used: regexp
used: getthisprop
used: gt
used: initelem
used: pop
>>> not used: deflocalfun
used: mod
used: getlocal
used: bitand
used: false
used: newarray
used: imtop
used: or
used: incgname
used: setlocal
used: getgname
used: new
>>> not used: calllocal
used: this
used: iter
used: getarg
used: lsh
used: null
used: localinc
used: lt
used: push
used: nullblockchain
used: uint16
used: div
used: rsh
used: callprop
>>> not used: nop
used: add
used: callname
used: getlocalprop
used: mul
used: call
used: goto
>>> not used: eval
used: setgname
used: stop
used: getprop
used: setelem
used: return
used: sub
used: endinit
used: inclocal

Original comment by classi...@floodgap.com on 26 Jan 2011 at 4:33

GoogleCodeExporter commented 9 years ago

A sample build with JSOP_GETFCSLOT, JSOP_LAMBDA, JSOP_DEFLOCALFUN, and 
JSOP_CALLLOCAL reduced to ARECORD_ABORTED in jstracer.cpp showed dramatically 
faster JS across the board. Time to figure out the actual offender of the four 
-- or it could be all of them. However, we now have TraceMonkey benching better 
than interpreter for the first time on G5!!! Let's do this for beta 11!

Original comment by classi...@floodgap.com on 26 Jan 2011 at 4:43

Changed state: Started
Added labels: Milestone-NextBeta, Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

Unfortunately the speed was only in debug mode, actual browser performance did 
improve but only from 5200 to around 4700. To get a significant win, we need to 
be under 3000.

JSOPs audit: the slow ones appear to be JSOP_LINENO (???), _UINT24, _CALLLOCAL 
and _GNAMEDEC/INC (LINENO is uncertain because I don't have good testing 
coverage for it). The other ops made little difference if on or off, and some 
got worse.

The next steps are:
1) Look at the instructions used by the faster ones only, and abort tracing for 
the other ops. This may not be possible.
2) These ones seem to have stack issues. Perhaps the stack is the problem, but 
I'm not sure yet.

Original comment by classi...@floodgap.com on 26 Jan 2011 at 9:11

GoogleCodeExporter commented 9 years ago

Current set of blacklisted JSOPs: NEG, anything calling setElem, CALLNAME, 
LINENO, UINT24, CALLLOCAL, GNAMEDEC, GNAMEINC. This gets us to 3700ms in 
SunSpider and wins on both Dromaeo and V8, so this is good enough to ship.

Original comment by classi...@floodgap.com on 30 Jan 2011 at 4:41

GoogleCodeExporter commented 9 years ago

changing flags

Original comment by classi...@floodgap.com on 31 Jan 2011 at 1:40

Removed labels: Milestone-NextBeta

GoogleCodeExporter commented 9 years ago

On our internal pull, RealClearPolitics has trouble with clicking on links. 
This does work in b9 with the nanojit on. Not sure if it's our blacklist or the 
interpreter, so making a note to recheck this after our next pull.

Original comment by classi...@floodgap.com on 31 Jan 2011 at 2:43

GoogleCodeExporter commented 9 years ago

Fixed by pull, so conclude Mozilla bug.

Original comment by classi...@floodgap.com on 3 Feb 2011 at 5:25

GoogleCodeExporter commented 9 years ago

Dropping priority as we appear to have reached a maximum for G5.

Original comment by classi...@floodgap.com on 22 Mar 2011 at 1:09

Added labels: Priority-Medium
Removed labels: Priority-High

GoogleCodeExporter commented 9 years ago

Here is something interesting, from glibc:

/* long int[r3] __lrint (double x[fp1])  */
ENTRY (__lrint)
        stwu    r1,-16(r1)
        fctiw   fp13,fp1
        stfd    fp13,8(r1)
        nop     /* Insure the following load is in a different dispatch group */
        nop     /* to avoid pipe stall on POWER4&5.  */
        nop
        lwz     r3,12(r1)
        addi    r1,r1,16
        blr
        END (__lrint)

This might be useful for ::asm_d2i -- we could insert some nop()s there.

Original comment by classi...@floodgap.com on 5 Apr 2011 at 4:51

GoogleCodeExporter commented 9 years ago

Other interesting optimizations:

http://sourceware.org/ml/libc-ports/2005-12/msg00004.html
mtctr   rTMP    /* Power4 wants mtctr 1st in dispatch group */

And they do use the same trick for fctid:
+ENTRY (__llrintf)  
+   CALL_MCOUNT
+   fctid   fp13,fp1
+   stfd    fp13,-8(r1)
+   nop /* Insure the following load is in a different dispatch group */
+   nop /* to avoid pipe stall on POWER4&5.  */
+   nop
+   lwz r3,-8(r1)
+   lwz r4,-4(r1)   
+   blr
+   END (__llrintf)

Original comment by classi...@floodgap.com on 5 Apr 2011 at 5:14

GoogleCodeExporter commented 9 years ago

And we also need to get MCRXR out of the nanojit, it is NOT native on G5! Argh! 
No wonder the G4 runs rings around it! We should replace it with equivalent 
mtxer and mfxer (i.e, mfspr rT,1 and mtspr 1,rT) for G5. Something like

+    mfxer  Rx 
+    mtcrf 0, Rx 
and, if we need the XER cleared (we probably should), 
+    rlwinm Rx,Rx,0,0,28 
+    mtxer  Rx 

should work ...

http://www.macintouch.com/tiger20.html and from Common Lisp,
(in-package "CCL")

(defppclapfunction do-mcrxr ((n arg_z))
  loop
  (cmpwi :cr1 arg_z '1)
  (mcrxr 0)
  (subi arg_z arg_z '1)
  (bge :cr1 loop)
  (blr))

(defppclapfunction do-mtxer ((n arg_z))
  loop
  (cmpwi :cr1 arg_z '1)
  (mtxer rzero)
  (subi arg_z arg_z '1)
  (bge :cr1 loop)
  (blr))

;;; (time (do-mcrxr 100000000))
;;; (time (do-mtxer 100000000))

Original comment by classi...@floodgap.com on 5 Apr 2011 at 5:45

GoogleCodeExporter commented 9 years ago

Apple code implies that mtcrf is okay for individual CR fields, iff it is one 
bitfield. From 
http://www.opensource.apple.com/source/xnu/xnu-1456.1.26/osfmk/ppc/bcopy.s

shortcopy:
            cmplw   r12,r5                      ; must move reverse if (dest-source)<length
            mtcrf   2,r5                        ; move length to cr6 and cr7 one at a time...
            mtcrf   1,r5                        ; ...which is faster on G4 and G5
            bge++   backend                     ; handle forward moves (most common case)
            add     r6,r6,r5                    ; point one past end of operands in reverse moves
            add     r4,r4,r5
            b       bbackend                    ; handle reverse moves

Although the modified code should work for G4/G3, we will keep mcrxr on those 
systems to reduce icache pressure.

Original comment by classi...@floodgap.com on 5 Apr 2011 at 6:04

GoogleCodeExporter commented 9 years ago

BASE
Richards: 144
DeltaBlue: 210
Crypto: 107
RayTrace: 407
EarleyBoyer: 521
----
Score: 233

SunSpider now 1760

MTCRF (swapon)
Richards: 1551
DeltaBlue: 479
Crypto: 915
RayTrace: 351
EarleyBoyer: 478
----
Score: 648

MTCRF (swapoff)
Richards: 1560
DeltaBlue: 480
Crypto: 910
RayTrace: 352
EarleyBoyer: 479
----
Score: 649

MCRXR (swapon)
Richards: 679
DeltaBlue: 318
Crypto: 22.3
RayTrace: 344
EarleyBoyer: 461
----
Score: 238

We keep the swap. We lose the mcrxr for G5. Everybody wins.

VERIFIED

Original comment by classi...@floodgap.com on 8 Apr 2011 at 10:00

Changed state: Verified

classilla / tenfourfox

G5-specific nanojit profiling #23