Closed arlolra closed 7 years ago
Is this really faster? The mind boggles. Regexps are usually a lot faster than char-by-char conversions, and we'd expect all-lowercase strings to be passed through unmodified (while in your patch they are rebuilt character-by-character even if there weren't any uppercase letters to begin with).
What about alternatives like:
function toASCIILowercase(s) {
if (!/[A-Z]/.test(s)) return s;
// whatever
}
and/or
function toASCIILowercase(s) {
return s.replace(/[A-Z]/g, function(c) { return String.fromCharCode(c.charCodeAt(0) + 32); });
}
?
Is this really faster?
Anecdotally, it seems to be.
The way I tested was with, node bin/parse --page Minneapolis --trace time < /dev/null > /dev/null
and looked at "dpLoader".
Baseline | ~470 |
This patch | ~425 |
First suggestion above (w/ baseline) | ~420 |
First suggestion above (w/ this patch) | ~415 |
Second suggestion above | ~480 |
Yeah, I went with "first suggestion above (w/ baseline)" in https://github.com/cscott/domino/commit/5573579bf54d1b1112429f6021fb3f297bbbbb5d since it seemed to have the same performance as the more-evil options I tried, without impacting code readability. (That's not official, that's just on my "random performance experiements" branch. https://github.com/cscott/domino/commit/08ee77fcbff4dac58f20ad224f8763af23183a07 also seemed to help.)
Sounds good to me.
Some performance numbers. Benchmark is dpLoader time on [[User:Acer/Simple4]]
, measured with:
node bin/parse.js --trace time --page "User:Acer/Simple4" --replay
Variant | Run 1 | Run 2 | Run 3 | Avg |
---|---|---|---|---|
both | 3547 | 3588 | 3550 | 3561 |
avoid calling toASCIILowerCase | 4055 | 3854 | 3862 | 3923 |
speed up toASCIILowerCase | 3472 | 3171 | 3685 | 3442 |
baseline | 3922 | 4004 | 4051 | 3992 |
The "avoid calling" patch seemed to help in other testing, though; perhaps dpLoader is just not exercising that code path. I need to look into that further.
/cc @cscott