Closed GoogleCodeExporter closed 9 years ago
Original comment by classi...@floodgap.com
on 10 Jul 2011 at 11:30
Original comment by classi...@floodgap.com
on 11 Jul 2011 at 9:23
Original comment by classi...@floodgap.com
on 11 Jul 2011 at 9:24
Original comment by classi...@floodgap.com
on 13 Aug 2011 at 11:10
[deleted comment]
[deleted comment]
[deleted comment]
[deleted comment]
[deleted comment]
Instead of loading an entire vector to memory couldn't we do the following?
register const vector unsigned short gtcompare = vec_unpackl( vec_splat_s8( -1
) );
This splats a signed int of -1 over a vector of signed ints whose low byte
(0xFF) is then unpacked into a vector of signed shorts. Interpreted as a vector
of unsigned shorts this is a vector with elements of 0x00FF (= 255) each.
That should be even faster than the following:
const unsigned short __attribute__ ((aligned(16))) twofivefive = 255;
register const vector unsigned short gtcompare = vec_splat( vec_lde( 0,
&twofivefive ), 0 );
Original comment by Tobias.N...@gmail.com
on 12 Sep 2011 at 12:05
That sounds like it should work, but I'll test it after 7 comes out.
Original comment by classi...@floodgap.com
on 12 Sep 2011 at 6:56
(reopened temporarily)
Original comment by classi...@floodgap.com
on 12 Sep 2011 at 6:57
It seems we have to use the following since the above doesn't work:
register const vector unsigned short gtcompare = vec_mergel( vec_splat_s8( 0 ),
vec_splat_s8( -1 ) );
Original comment by Tobias.N...@gmail.com
on 19 Sep 2011 at 8:50
On my system the benchmark doesn't show any difference between the three
versions. Might be my system is too slow...
Original comment by Tobias.N...@gmail.com
on 21 Sep 2011 at 7:17
Yeah, I don't see much difference either. Let's leave it, since it does work,
at least. The compiler may be optimizing this for us, or at least optimizing it
enough to make an imperceptible difference.
Original comment by classi...@floodgap.com
on 27 Sep 2011 at 4:21
Reopened until issue 84 is fixed.
Original comment by Tobias.N...@gmail.com
on 10 Oct 2011 at 5:33
Examined the code and didn't find a bug in the implementation - provided
vec_any_gt() and the code to align to a 16 byte boundary really works in every
case.
Original comment by Tobias.N...@gmail.com
on 11 Oct 2011 at 4:45
Didn't find a bug but instaead a little tweak that improves benchmark scores by
16% on a G4 1,5 GHz !!!
The improvement comes by passing the offset to the (constant) base buffer
address directly to vec_ld instead of adding it to the buffer pointer itself
each loop iteration.
Here's the code:
for(; i < vectWalkEnd; i += 16) {
vect = vec_ld(i, (unsigned short *)(str)); // not ldl -- keep in L1
if (vec_any_gt(vect, gtcompare))
return PR_FALSE;
}
Original comment by Tobias.N...@gmail.com
on 11 Oct 2011 at 9:12
I'll take the enhancement. It doesn't change the tables though?
I'm going to rebuild the G5 version tonight and dig into this more. Rewriting
84 to indicate blocked on this instead of vice versa.
Original comment by classi...@floodgap.com
on 12 Oct 2011 at 1:36
Original comment by classi...@floodgap.com
on 12 Oct 2011 at 1:37
So, now why not initialize the vector like we do in issue 76 ?
Resulting code looks like this:
register const vector unsigned short gecompare = vec_sl( vec_splat_u16( 1 ), vec_splat_u16( 8 ) );
// Check one VMX register (16 bytes) at a time.
// This is simpler on AltiVec and involves no mucking about with masks,
// since the vec_any_gt intrinsic does exactly what we want.
const PRInt32 vectWalkEnd = ((len - i) / numUnicharsPerVector) * numUnicharsPerVector;
for(; i < vectWalkEnd; i += 16) {
vect = vec_ld(i, (unsigned short *)(str)); // not ldl -- keep in L1
if (vec_any_ge(vect, gecompare))
return PR_FALSE;
}
Original comment by Tobias.N...@gmail.com
on 13 Oct 2011 at 9:14
Why switch to _any_ge? FTR, I'm using your gtcompare from above and your loop
in comment 18. I can't see much difference on the G5, but if it seems faster on
your system, I'll still accept it.
Wolf-fencing the code shows it's not the AltiVec loop that's to blame, it's
actually the word-by-word check that follows it. The mask needed to be flipped.
It now works correctly and issue 84 is solved.
Original comment by classi...@floodgap.com
on 14 Oct 2011 at 3:55
That's great!
Original comment by Tobias.N...@gmail.com
on 14 Oct 2011 at 6:19
What must the mask look like to make it work correctly?
Why does it have to be changed at all? Since it goes byte by byte there
shouldn't be any endian issues.
Original comment by Tobias.N...@gmail.com
on 14 Oct 2011 at 3:15
Well, now I understand the problem with the mask. Works OK here as well.
If we change to vec_any_ge we can use the faster vector initialization with
vec_splat_* - the improvement is marginal.
The improvement with passing the offset directly to vec_ld comes from the fact
that it saves the integer operation to add the index to the pointer.
Original comment by Tobias.N...@gmail.com
on 16 Oct 2011 at 12:35
I'm already using vec_splat (but your -1 version). So it's neither here nor
there.
The mask issue was I believed the reinterpreted cast would be handed to the
mask in the wrong endianness, so I flipped the mask initially. However, the
cast is also big endian, so the mask was the wrong endianness instead. Flipping
it back fixed the problem.
Original comment by classi...@floodgap.com
on 17 Oct 2011 at 2:16
Issue 84 has been merged into this issue.
Original comment by classi...@floodgap.com
on 17 Oct 2011 at 2:21
The code in comment 21 causes lots of bad UTF-8 characters in the suggestion
box, so I reverted it.
Original comment by classi...@floodgap.com
on 17 Oct 2011 at 5:26
I'm also seeing some wrong characters.
So you currently are using "vec_mergel( vec_splat_s8( 0 ), vec_splat_s8( -1 )
);" together with vec_any_gt?
I imagine that vec_mergel might be faster than vec_sl so that should be the
better way to do a fast initialization.
I'll verify whether enhancing this with unaligned loads could bring some
benefit.
Original comment by Tobias.N...@gmail.com
on 17 Oct 2011 at 6:10
[deleted comment]
I found the error in my above suggestions that caused the wrong characters. The
code must look like this:
const PRInt32 vectWalkEnd = ((len - i) / numUnicharsPerVector) * numCharsPerVector;
PRInt32 i2 = i * (numCharsPerVector/numUnicharsPerVector);
for(; i2 < vectWalkEnd; i2 += numCharsPerVector) {
vect = vec_ld(i2, (unsigned short *)str); // not ldl -- keep in L1
if (vec_any_gt(vect, gtcompare))
return PR_FALSE;
}
i = i2 / (numCharsPerVector/numUnicharsPerVector);
I now enhanced this with loading from unaligned locations and after looking
more into the Apple docs about Altivec I also enhanced the aligned loop to
compare two vectors simulaneously. In theory loading and comparing two
independent vectors should be exactly as fast as loading and comparing just one.
I'll do some performance testing tomorrow.
Original comment by Tobias.N...@gmail.com
on 17 Oct 2011 at 9:40
[deleted comment]
[deleted comment]
[deleted comment]
[deleted comment]
[deleted comment]
[deleted comment]
Investigated further by benchmarking with different buffer sizes:
- a tuned version of this (I'll post the code here) is faster then the
unvectorized code on a G4/500 (7410) from a buffer size of 32 bytes on
Original comment by Tobias.N...@gmail.com
on 22 Oct 2011 at 9:57
Investigated further by benchmarking with different buffer sizes:
- same behaviour with a G4/1,5GHz (7450) as on the G4/500 (7410)
Original comment by Tobias.N...@gmail.com
on 22 Oct 2011 at 10:32
The code seems to be fast and the overhead needed to set up the altivec
environment and loading the vectors is quite small.
Original comment by Tobias.N...@gmail.com
on 22 Oct 2011 at 11:06
This seems probably safe but since we know this code can introduce subtle bugs,
I'm going to stall on it until the 9 beta since 8 RC is just around the corner.
Original comment by classi...@floodgap.com
on 25 Oct 2011 at 1:22
Here the promised tuned version which processes small buffers (<32 bytes)
faster.
It also unrolls the loop a bit which results in major performance.
Original comment by Tobias.N...@gmail.com
on 25 Oct 2011 at 7:34
Attachments:
Scheduling new version for 9 now that 8 RC is in build
Original comment by classi...@floodgap.com
on 5 Nov 2011 at 5:04
Issue 109 has been merged into this issue.
Original comment by classi...@floodgap.com
on 20 Nov 2011 at 3:09
We need to rewrite this as FirstNon8Bit for Fx9 (issue 109). Basically the
same, but a little extra bit to find bidi.
Original comment by classi...@floodgap.com
on 20 Nov 2011 at 3:09
#ifdef MOZILLA_MAY_SUPPORT_SSE2
namespace mozilla {
namespace SSE2 {
PRInt32 FirstNon8Bit(const PRUnichar *str, const PRUnichar *end);
}
}
#endif
Original comment by classi...@floodgap.com
on 20 Nov 2011 at 3:10
Rewritten to match issue 109. Landed in 9 beta
Original comment by classi...@floodgap.com
on 23 Nov 2011 at 6:17
diff -r 6791db28b82f content/base/src/nsInProcessTabChildGlobal.cpp
--- a/content/base/src/nsInProcessTabChildGlobal.cpp Wed Aug 31 10:11:17 2011
-0400
+++ b/content/base/src/nsInProcessTabChildGlobal.cpp Tue Oct 25 21:32:08 2011
+0200
@@ -294,7 +294,9 @@
nsContentUtils::XPConnect()->SetSecurityManagerForJSContext(cx, nsContentUtils::GetSecurityManager(), 0);
nsContentUtils::GetSecurityManager()->GetSystemPrincipal(getter_AddRefs(mPrincipal));
- JS_SetNativeStackQuota(cx, 128 * sizeof(size_t) * 1024);
+ //JS_SetNativeStackQuota(cx, 128 * sizeof(size_t) * 1024);
+ // Not enough.
+ JS_SetNativeStackQuota(cx, 32768 * 1024);
JS_SetOptions(cx, JS_GetOptions(cx) | JSOPTION_JIT | JSOPTION_PRIVATE_IS_NSISUPPORTS);
JS_SetVersion(cx, JSVERSION_LATEST);
Do you really need 32 MB of native stack space, _per context_ ?!
Original comment by magef...@gmail.com
on 2 Dec 2011 at 8:54
Try backing it out and then run SunSpider, and see what happens. :P (Please
follow up in issue 114 if you think a different solution applies.)
Original comment by classi...@floodgap.com
on 3 Dec 2011 at 12:21
#define CheckForASCII \
vect = vec_ld(i2, (unsigned short *)str); \
if (vec_any_gt(vect, gtcompare)) \
return i2; \
i2 += numCharsPerVector; \
if (!(i2 < vectWalkEnd)) \
break;
Honestly, I don't think returning i2 is correct. I'm quite sure it must be "i2
/ (numCharsPerVector/numUnicharsPerVector)"
And there's a chance that this will fix the problem with issue 76 .
Original comment by Tobias.N...@gmail.com
on 8 Dec 2011 at 8:43
Original issue reported on code.google.com by
classi...@floodgap.com
on 10 Jul 2011 at 11:30