classilla / tenfourfox

Mozilla for Power Macintosh.
http://www.tenfourfox.com/
Other
274 stars 41 forks source link

nsTextFragmentVMX.cpp (and optimizations) #75

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Created in content/base/src modeled on the SSE2 version. On the G5 in reduced, 
6.0 before the fix with the benchmark in 
https://bugzilla.mozilla.org/show_bug.cgi?id=585978 yielded

1382ms
612ms
614ms
674ms
658ms
562ms
575ms
564ms
587ms
585ms
560ms
584ms
572ms
587ms
589ms
579ms
590ms
562ms
615ms
592ms
570ms
578ms
561ms
616ms
561ms
565ms
559ms
571ms
585ms
569ms

Now,

1240ms
504ms
521ms
493ms
501ms
483ms
525ms
499ms
489ms
497ms
510ms
537ms
493ms
518ms
518ms
507ms
501ms
505ms
526ms
508ms
488ms
505ms
490ms
512ms
488ms
502ms
489ms
505ms
480ms
499ms

This is a solid improvement of 15%.

Original issue reported on code.google.com by classi...@floodgap.com on 10 Jul 2011 at 11:30

GoogleCodeExporter commented 9 years ago

Original comment by classi...@floodgap.com on 10 Jul 2011 at 11:30

GoogleCodeExporter commented 9 years ago

Original comment by classi...@floodgap.com on 11 Jul 2011 at 9:23

GoogleCodeExporter commented 9 years ago

Original comment by classi...@floodgap.com on 11 Jul 2011 at 9:24

GoogleCodeExporter commented 9 years ago

Original comment by classi...@floodgap.com on 13 Aug 2011 at 11:10

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Instead of loading an entire vector to memory couldn't we do the following?

register const vector unsigned short gtcompare = vec_unpackl( vec_splat_s8( -1 
) );

This splats a signed int of -1 over a vector of signed ints whose low byte 
(0xFF) is then unpacked into a vector of signed shorts. Interpreted as a vector 
of unsigned shorts this is a vector with elements of 0x00FF (= 255) each.

That should be even faster than the following:

const unsigned short __attribute__ ((aligned(16))) twofivefive = 255;
register const vector unsigned short gtcompare = vec_splat( vec_lde( 0, 
&twofivefive ), 0 );

Original comment by Tobias.N...@gmail.com on 12 Sep 2011 at 12:05

GoogleCodeExporter commented 9 years ago
That sounds like it should work, but I'll test it after 7 comes out.

Original comment by classi...@floodgap.com on 12 Sep 2011 at 6:56

GoogleCodeExporter commented 9 years ago
(reopened temporarily)

Original comment by classi...@floodgap.com on 12 Sep 2011 at 6:57

GoogleCodeExporter commented 9 years ago
It seems we have to use the following since the above doesn't work:

register const vector unsigned short gtcompare = vec_mergel( vec_splat_s8( 0 ), 
vec_splat_s8( -1 ) );

Original comment by Tobias.N...@gmail.com on 19 Sep 2011 at 8:50

GoogleCodeExporter commented 9 years ago
On my system the benchmark doesn't show any difference between the three 
versions. Might be my system is too slow...

Original comment by Tobias.N...@gmail.com on 21 Sep 2011 at 7:17

GoogleCodeExporter commented 9 years ago
Yeah, I don't see much difference either. Let's leave it, since it does work, 
at least. The compiler may be optimizing this for us, or at least optimizing it 
enough to make an imperceptible difference.

Original comment by classi...@floodgap.com on 27 Sep 2011 at 4:21

GoogleCodeExporter commented 9 years ago
Reopened until issue 84 is fixed.

Original comment by Tobias.N...@gmail.com on 10 Oct 2011 at 5:33

GoogleCodeExporter commented 9 years ago
Examined the code and didn't find a bug in the implementation - provided 
vec_any_gt() and the code to align to a 16 byte boundary really works in every 
case.

Original comment by Tobias.N...@gmail.com on 11 Oct 2011 at 4:45

GoogleCodeExporter commented 9 years ago
Didn't find a bug but instaead a little tweak that improves benchmark scores by 
16% on a G4 1,5 GHz !!!

The improvement comes by passing the offset to the (constant) base buffer 
address directly to vec_ld instead of adding it to the buffer pointer itself 
each loop iteration.

Here's the code:

  for(; i < vectWalkEnd; i += 16) {
    vect = vec_ld(i, (unsigned short *)(str)); // not ldl -- keep in L1
    if (vec_any_gt(vect, gtcompare))
      return PR_FALSE;
  }

Original comment by Tobias.N...@gmail.com on 11 Oct 2011 at 9:12

GoogleCodeExporter commented 9 years ago
I'll take the enhancement. It doesn't change the tables though?

I'm going to rebuild the G5 version tonight and dig into this more. Rewriting 
84 to indicate blocked on this instead of vice versa.

Original comment by classi...@floodgap.com on 12 Oct 2011 at 1:36

GoogleCodeExporter commented 9 years ago

Original comment by classi...@floodgap.com on 12 Oct 2011 at 1:37

GoogleCodeExporter commented 9 years ago
So, now why not initialize the vector like we do in issue 76 ?
Resulting code looks like this:

  register const vector unsigned short gecompare = vec_sl( vec_splat_u16( 1 ), vec_splat_u16( 8 ) );

  // Check one VMX register (16 bytes) at a time.
  // This is simpler on AltiVec and involves no mucking about with masks,
  // since the vec_any_gt intrinsic does exactly what we want.
  const PRInt32 vectWalkEnd = ((len - i) / numUnicharsPerVector) * numUnicharsPerVector;
  for(; i < vectWalkEnd; i += 16) {
    vect = vec_ld(i, (unsigned short *)(str)); // not ldl -- keep in L1
    if (vec_any_ge(vect, gecompare))
      return PR_FALSE;
  }

Original comment by Tobias.N...@gmail.com on 13 Oct 2011 at 9:14

GoogleCodeExporter commented 9 years ago
Why switch to _any_ge? FTR, I'm using your gtcompare from above and your loop 
in comment 18. I can't see much difference on the G5, but if it seems faster on 
your system, I'll still accept it.

Wolf-fencing the code shows it's not the AltiVec loop that's to blame, it's 
actually the word-by-word check that follows it. The mask needed to be flipped. 
It now works correctly and issue 84 is solved.

Original comment by classi...@floodgap.com on 14 Oct 2011 at 3:55

GoogleCodeExporter commented 9 years ago
That's great!

Original comment by Tobias.N...@gmail.com on 14 Oct 2011 at 6:19

GoogleCodeExporter commented 9 years ago
What must the mask look like to make it work correctly?
Why does it have to be changed at all? Since it goes byte by byte there 
shouldn't be any endian issues.

Original comment by Tobias.N...@gmail.com on 14 Oct 2011 at 3:15

GoogleCodeExporter commented 9 years ago
Well, now I understand the problem with the mask. Works OK here as well.

If we change to vec_any_ge we can use the faster vector initialization with 
vec_splat_* - the improvement is marginal.

The improvement with passing the offset directly to vec_ld comes from the fact 
that it saves the integer operation to add the index to the pointer.

Original comment by Tobias.N...@gmail.com on 16 Oct 2011 at 12:35

GoogleCodeExporter commented 9 years ago
I'm already using vec_splat (but your -1 version). So it's neither here nor 
there.

The mask issue was I believed the reinterpreted cast would be handed to the 
mask in the wrong endianness, so I flipped the mask initially. However, the 
cast is also big endian, so the mask was the wrong endianness instead. Flipping 
it back fixed the problem.

Original comment by classi...@floodgap.com on 17 Oct 2011 at 2:16

GoogleCodeExporter commented 9 years ago
Issue 84 has been merged into this issue.

Original comment by classi...@floodgap.com on 17 Oct 2011 at 2:21

GoogleCodeExporter commented 9 years ago
The code in comment 21 causes lots of bad UTF-8 characters in the suggestion 
box, so I reverted it.

Original comment by classi...@floodgap.com on 17 Oct 2011 at 5:26

GoogleCodeExporter commented 9 years ago
I'm also seeing some wrong characters.

So you currently are using "vec_mergel( vec_splat_s8( 0 ), vec_splat_s8( -1 ) 
);" together with vec_any_gt?
I imagine that vec_mergel might be faster than vec_sl so that should be the 
better way to do a fast initialization.

I'll verify whether enhancing this with unaligned loads could bring some 
benefit.

Original comment by Tobias.N...@gmail.com on 17 Oct 2011 at 6:10

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I found the error in my above suggestions that caused the wrong characters. The 
code must look like this:

  const PRInt32 vectWalkEnd = ((len - i) / numUnicharsPerVector) * numCharsPerVector;
  PRInt32 i2 = i * (numCharsPerVector/numUnicharsPerVector);
  for(; i2 < vectWalkEnd; i2 += numCharsPerVector) {
    vect = vec_ld(i2, (unsigned short *)str); // not ldl -- keep in L1
    if (vec_any_gt(vect, gtcompare))
      return PR_FALSE;
  }
  i = i2 / (numCharsPerVector/numUnicharsPerVector);

I now enhanced this with loading from unaligned locations and after looking 
more into the Apple docs about Altivec I also enhanced the aligned loop to 
compare two vectors simulaneously. In theory loading and comparing two 
independent vectors should be exactly as fast as loading and comparing just one.

I'll do some performance testing tomorrow.

Original comment by Tobias.N...@gmail.com on 17 Oct 2011 at 9:40

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Investigated further by benchmarking with different buffer sizes:
- a tuned version of this (I'll post the code here) is faster then the 
unvectorized code on a G4/500 (7410) from a buffer size of 32 bytes on

Original comment by Tobias.N...@gmail.com on 22 Oct 2011 at 9:57

GoogleCodeExporter commented 9 years ago
Investigated further by benchmarking with different buffer sizes:
- same behaviour with a G4/1,5GHz (7450) as on the G4/500 (7410)

Original comment by Tobias.N...@gmail.com on 22 Oct 2011 at 10:32

GoogleCodeExporter commented 9 years ago
The code seems to be fast and the overhead needed to set up the altivec 
environment and loading the vectors is quite small.

Original comment by Tobias.N...@gmail.com on 22 Oct 2011 at 11:06

GoogleCodeExporter commented 9 years ago
This seems probably safe but since we know this code can introduce subtle bugs, 
I'm going to stall on it until the 9 beta since 8 RC is just around the corner.

Original comment by classi...@floodgap.com on 25 Oct 2011 at 1:22

GoogleCodeExporter commented 9 years ago
Here the promised tuned version which processes small buffers (<32 bytes) 
faster.
It also unrolls the loop a bit which results in major performance.

Original comment by Tobias.N...@gmail.com on 25 Oct 2011 at 7:34

Attachments:

GoogleCodeExporter commented 9 years ago
Scheduling new version for 9 now that 8 RC is in build

Original comment by classi...@floodgap.com on 5 Nov 2011 at 5:04

GoogleCodeExporter commented 9 years ago
Issue 109 has been merged into this issue.

Original comment by classi...@floodgap.com on 20 Nov 2011 at 3:09

GoogleCodeExporter commented 9 years ago
We need to rewrite this as FirstNon8Bit for Fx9 (issue 109). Basically the 
same, but a little extra bit to find bidi.

Original comment by classi...@floodgap.com on 20 Nov 2011 at 3:09

GoogleCodeExporter commented 9 years ago
#ifdef MOZILLA_MAY_SUPPORT_SSE2
namespace mozilla {
  namespace SSE2 {
    PRInt32 FirstNon8Bit(const PRUnichar *str, const PRUnichar *end);
  }
}
#endif

Original comment by classi...@floodgap.com on 20 Nov 2011 at 3:10

GoogleCodeExporter commented 9 years ago
Rewritten to match issue 109. Landed in 9 beta

Original comment by classi...@floodgap.com on 23 Nov 2011 at 6:17

GoogleCodeExporter commented 9 years ago
diff -r 6791db28b82f content/base/src/nsInProcessTabChildGlobal.cpp
--- a/content/base/src/nsInProcessTabChildGlobal.cpp    Wed Aug 31 10:11:17 2011 
-0400
+++ b/content/base/src/nsInProcessTabChildGlobal.cpp    Tue Oct 25 21:32:08 2011 
+0200
@@ -294,7 +294,9 @@
   nsContentUtils::XPConnect()->SetSecurityManagerForJSContext(cx, nsContentUtils::GetSecurityManager(), 0);
   nsContentUtils::GetSecurityManager()->GetSystemPrincipal(getter_AddRefs(mPrincipal));

-  JS_SetNativeStackQuota(cx, 128 * sizeof(size_t) * 1024);
+  //JS_SetNativeStackQuota(cx, 128 * sizeof(size_t) * 1024);
+  // Not enough.
+  JS_SetNativeStackQuota(cx, 32768 * 1024);

   JS_SetOptions(cx, JS_GetOptions(cx) | JSOPTION_JIT | JSOPTION_PRIVATE_IS_NSISUPPORTS);
   JS_SetVersion(cx, JSVERSION_LATEST);

Do you really need 32 MB of native stack space, _per context_ ?!

Original comment by magef...@gmail.com on 2 Dec 2011 at 8:54

GoogleCodeExporter commented 9 years ago
Try backing it out and then run SunSpider, and see what happens. :P (Please 
follow up in issue 114 if you think a different solution applies.)

Original comment by classi...@floodgap.com on 3 Dec 2011 at 12:21

GoogleCodeExporter commented 9 years ago
#define CheckForASCII                                           \
      vect = vec_ld(i2, (unsigned short *)str);                 \
      if (vec_any_gt(vect, gtcompare))                          \
        return i2;                                      \
      i2 += numCharsPerVector;                                  \
      if (!(i2 < vectWalkEnd))                                  \
        break;

Honestly, I don't think returning i2 is correct. I'm quite sure it must be "i2 
/ (numCharsPerVector/numUnicharsPerVector)"

And there's a chance that this will fix the problem with issue 76 .

Original comment by Tobias.N...@gmail.com on 8 Dec 2011 at 8:43