[List] Several HUGE performances optimisations.

denis-migdal commented 9 months ago

Summary ========

I can't do

[ ] [2270] Easy optimizations on DEBUG information for quicker transpilation and runtime speed.
[ ] [2269] Implements str as string with prototype replacement.

Quite quick

[ ] [2275] Improve performances upon fct call
[ ] [2274] Optimise conditions transcription in AST.
[ ] [2260] -48% speed performances on Brython function calls (more info)
[ ] [2260] non-preallocated array, -18% speed performances (+reduce memory usage) (more info)
[x] [2260] check if float -6% speed performances (+ alternative more readable) (more info)

Search in code

[ ] [2268] Use BigInt64Array instead of BigInt (more info)
[ ] [2267] in py_str.js : use Javascript methods... (more info)
[ ] [2266] Factorize checks for better readability/maintainability/performances/and reduction of code size (more info)
[ ] [2266] checks-mode=unsafe option to speed up runtime time by removing all checks. Checks are 28% slower. (more info)
[ ] [2260] -42% and -95% speed performances in JS template literal (more info)
[ ] [2260] Search for non-preallocated array, + toString().indexOf('.') (replace by %1).
[ ] [2260] Use ES6 Set in py_set.js ?
[ ] [2260] Use string encode/decode in bytes.js ?
[ ] [2249] Cache wrappers potentially huge perf increases in execution and memory usage + would fix bugs (more info, +below)
[ ] [2260] Optimize functions calls/rewritting 48% slower (more info)

DONE

[X] [2260] -34% speed performances on Brython startup (more info)

===================================================================================

Hi,

I noticed some parts of Brython code were like that :

x = `First line\n` +
`Second line\n` +
`Third line`

However, it seems 42% slower compared to simply doing :

x = `First line
Second line
Third line`

I'd advise to search for all \n` in Brython code, remove the "+", and replace them by a new line.

Also, in some parts there is :

x+= "eee\n"
x+= "eee\n"
x+= "eee\n"

That seems 95% slower.

I'd advise to search for all += " in Brython code, remove the "+=", and replace them by a string template.

Cordially,

denis-migdal commented 9 months ago

[DONE]

Found a part of the code responsible for more than 34% of CPU time (66ms/163ms), even when the Brython code is empty :

$B.unicode[gc].forEach(function(item){if(Array.isArray(item)){var step=item[2]||1
for(var i=0,nb=item[1];i < nb;i+=1){$B.unicode_tables[gc][item[0]+i*step]=true}}else{$B.unicode_tables[gc][item]=true}}

Some things that might increase performances :

all x+=1 should be replaced by ++x (faster). (a search for +=1 and += 1 should help identify them.
$B.unicode_tables[gc] pre-compute it before the forEach ?
not sure, but for(var i = 0, max=item[1]*step; i < max ; i+= step) { ...[i] = true } ?
replace forEach(), it seems slow
replace Array.isArray ?

So something like (?) :

var ugc = $B.unicode_tables[gc];
for(let x = 0; x < ugc.length; ++x) {
     var item = ugc[x];
     var max = item[1];
     if( max === undefined) { // not an array ?
          ugc[item] = true
     } else {
          var i = item[0]
          var step = item[2]||1;
          max = max * step + i
          for( ;i < max;i+=step)
                ugc[i] = true
     }
}

EDIT: I don't really get what this function is doing, but I guess precomputing everything and storing the result in brython.js isn't possible (brython.js would then be too big) ?

JavaScript has some functions to handle unicode : fromCharCode and charCodeAt, but I assume it isn't what you are doing ?

denis-migdal commented 9 months ago

Brython functions call are converted to (according to documentation : https://github.com/brython-dev/brython/wiki/How-Brython-works ):

f.apply(null, [1].concat(list(t)).concat([{$nat: "kw", kw:[{x: 2}, d]}))

However, it is 48% slower compared to :

f(1, ...list(t), {$nat: "kw", kw:[{x: 2}, d]});

EDIT: is there also a reason to not do : {$nat: "kw", kw:{x: 2, ...d}} or even directly {$nat: {x: 2, ...d}} ?

denis-migdal commented 9 months ago

Also, for potential execution speed gain, see also :

[2249] Cache wrappers (using WeakMap and WeakRef) created by Brython to prevent recreating them at each access (https://github.com/brython-dev/brython/issues/2249#issuecomment-1741802743)

denis-migdal commented 9 months ago

Tested the performance of WeakMap :

For a jsobj2pyobj() doing almost nothing (so not quite a fair comparison) :

using cache is at most 30% slower - if obj exists in cache
using cache is at most 15% slower - if obj doesn't exists in cache

So quite a small overhead considering that my jsobj2pyobj() is doing almost nothing

Also for :

array with 2 elements, jsobj2pyobj is at least 66% slower when the array is in cache.
a function with 2 parameters, jsobj2pyobj is at least 20% slower when the function is in cache.

Imma test other things I saw too.

denis-migdal commented 9 months ago

jsobj.toString().indexOf('.') == -1 is 6% slower compared to jsobj % 1 === 0 . Note that the latter is quicker AND more readable.

Also, you can pre-allocate arrays (-18% perfs) both in jsobj2pyobj() and pyobj2jsobj() (and you seem to do that a lot):

 var args = []
            for(var i = 0, len = arguments.length; i < len; i++){
                args.push(pyobj2jsobj(arguments[i]))
            }

 var args = new Array(arguments.length)
            for(var i = 0, len = arguments.length; i < len; i++){
                args[i] = pyobj2jsobj(arguments[i])
            }

EDIT : if you don't know the final length, you can also do :

 var args = new Array(arguments.length)
var size = 0
            for(var i = 0, len = arguments.length; i < len; i++){
                if(???) args[size++] = pyobj2jsobj(arguments[i])
            }
args.length = size

denis-migdal commented 9 months ago

Now that ES6 is out since 2015, I think all browser now implements it. So would relying more on ES6 "new" classes also improve performances ?

JS Set() for py_set is possible and/or could boost the performances (as well as reducing the code size).
JS decode/encode for "UTF8" conversions in py_bytes

PierreQuentel commented 8 months ago

Found a part of the code responsible for more than 34% of CPU time (66ms/163ms), even when the Brython code is empty :

$B.unicode[gc].forEach(function(item){if(Array.isArray(item)){var step=item[2]||1
for(var i=0,nb=item[1];i < nb;i+=1){$B.unicode_tables[gc][item[0]+i*step]=true}}else{$B.unicode_tables[gc][item]=true}}

Wow, this is amazing ! How did you find it ?

In the commit referenced above I have modified a similar code for Unicode tables XID_Start and XID_Continue: I have completely removed the lines in unicode_data.js

for(const key in $B.unicode_identifiers){
    $B.unicode_tables[key] = {}
    for(const item of $B.unicode_identifiers[key]){
        if(Array.isArray(item)){
            for(var i = 0; i < item[1]; i++){
                $B.unicode_tables[key][item[0] + i] = true
            }
        }else{
            $B.unicode_tables[key][item] = true
        }
    }
}

and added functions $B.is_XID_Start(codepoint) and $B.is_XID_Continue(codepoint) in python_tokenizer.js that check if a codepoint is in one of these categories by searching directly (by bisection) in the table $B.unicode_identifiers.

As far as I could test, this reduces startup time by around 200ms and has no impact on execution speed.

I am pretty sure we can do the same for the lines you mention.

PierreQuentel commented 8 months ago

JS Set() for py_set is possible and/or could boost the performances (as well as reducing the code size).

I wish it were possible, but I am not sure. Javascript Set() covers most of the features of Python set but in some cases I'm afraid it will be hard to get the same result, for instance with this test (in test_set.py):

class A:
  def __init__(self, x):
    self.x = x
  def __eq__(self, other):
    return self.x
  def __hash__(self):
    return hash('a')

set1 = {'a', 'b', 'c', 0, 1, 2, 1.5, 2.5, 3.5}

assert A(True) in set1

The assertion is true because A(True) has the same hash as and compares equal to 'a'...

denis-migdal commented 8 months ago

Wow, this is amazing ! How did you find it ?

In browser's development tools, you have a "Performance tab". I am not sure how to use it in Firefox, but on Chromium :

Ctrl+Maj+I
Performance tab
Either record or reload (depending whether you want to record the page load or the execution after e.g. clicking a button)
Once done, click on "stop"
Click on the "Bottom-up" tab (Call tree works too)
Sort by "Total time"
Then in "Activity", click on the triangles to find the functions that has the higher %.

If you want to test small parts of code, website like JSPerf are quite useful. Careful, as performances can be very different between Chromium and Firefox.

But some stuff are quite generic, e.g. :

pre-allocate vs push.

Push will pre-allocate e.g. 4 spaces, then once you insert the 5th element, it'll allocate the double. So for big arrays you'll get reallocations (and possibly copy of the memory), which is very costly (this is a system call). Performances gains can be really enormous for big arrays, (e.g. x50/x100 perfs ?).

As far as I could test, this reduces startup time by around 200ms and has no impact on execution speed.

I am pretty sure we can do the same for the lines you mention.

Maybe, yes. Though, I am not quite sure to get what you do exactly with unicode (as JS has support for encoding/decoding).

denis-migdal commented 8 months ago

The assertion is true because A(True) has the same hash as and compares equal to 'a'...

Hmmm....

In this case, why not using a JS Map, putting the hash as the key, and stored objects (an array) as a value ?

Then, to test if an object is in the Set, find the elements having the same hash 0(1), then compare them with __eq__ 0(n) (but you shouldn't have too much objects with the same hash) ?

EDIT: For set operations, maybe building a Set of hashes could speed up or simplify some operations implementations ?

denis-migdal commented 8 months ago

As far as I could test, this reduces startup time by around 200ms and has no impact on execution speed.

One think that is interesting to do, is, in optimization commits, putting the relative and absolute time gain : e.g. "-22% startup time (300ms -> 200ms)", so that you can :

better communicate the performance gain as the absolute execution speed depends on the computer used, and that the absolute gain doesn't tell much : e.g. a gain of 200ms on a 1h process is bad (nearly 0% gain) while a gain of 200ms on a process of 202ms is immense (-99%).
you can then make a summary of the gain more easily in the versions notes / changelogs / whatever.
identify the "performances" of each optimization made (this can be useful is you need one day to revert something for X or Y reason).

denis-migdal commented 8 months ago

Wow, this is amazing ! How did you find it ?

In browser's development tools, you have a "Performance tab". I am not sure how to use it in Firefox, but on Chromium :

In Chromium you also have a tab "Lighthouse", that have at the top right "Analyze page load". Note that it doesn't work with "file://" (so you need a local server), and you need to have some content in the page body.

The goal is to have "100%" in the Performance score. With your recent change, it went from "54%" to "58%?" when you added support for defer (from memory), and now it is at 88% with your recent commit (83% without defer). That is a huge increase ^^.

You still have 450ms of blocking time so still have a margin of progress I assume.

Though I use "browser-sync" to have my local server, so it hurts a little the performances.

denis-migdal commented 8 months ago

You also have in Chromium other tabs :

Memory : to see the RAM your webpage consume. Though I'm not quite familiar with it.
Network : to see the HTTP exchanges between the web page and external resources.

PierreQuentel commented 8 months ago

Though, I am not quite sure to get what you do exactly with unicode (as JS has support for encoding/decoding).

It is used internally in various places, for instance to check valid identifiers as described here, to support code like

# non-ASCII variable names
donnée = 10
машина = 9
ήλιος = 4

assert donnée + машина + ήλιος == 23

# Korean
def 안녕하세요():
    return "hello"

assert 안녕하세요() == "hello"

denis-migdal commented 8 months ago

Though, I am not quite sure to get what you do exactly with unicode (as JS has support for encoding/decoding).

It is used internally in various places, for instance to check valid identifiers as described here, to support code like

Ouch... It seems to be a very inefficient way of doing it.

By default, JS strings are in UTF16, so you should be able to do that very efficiently by verifying if the char code is between given ranges. If you have lot of ranges you may even have some kind of binary search of ranges (maybe in a form of a binary decision tree - that you could even somehow flatten to an array just by playing with the indexes ?). If you have ranges with "holes" you can have some "exception ranges". And you can also detect some patterns with "%", Math.floor(x/Z), etc.

I think you have ways to really speed up this process.

Moreover i think I saw the code for unicode takes ~100kB, so you may even reduce brython.js size by 10% (as well as downloading and JS interpretation time) ?

If you have a different encoding, you can first convert them :

var decoder = new TextDecoder('utf-8'), // utf8 to utf16
decodedMessage = decoder.decode(texte);

var encoder = new TextEncoder('utf-8'), // utf16 to utf8
encodedMessage = encoder.encode(texte);

From that you can also convert it to bytes, Uint8Array etc.

Performance wise :

TextDecoder is 33% faster than doing it by hand.
TextEncoder seems very fast (2ms vs 447ms) ?

// code stolen from https://gist.github.com/fabiospampinato/014d15872e2129774ae23783bd377ad2
const content = 'e'.padEnd(2000000, 'z')

const charCodesAt = new Array ( content.length );
console.time('charCodeAt');
for ( let i = 0, l = content.length; i < l; i++ ) {
  charCodesAt[i] = content.charCodeAt ( i );
}
console.timeEnd('charCodeAt');

const encoder = new TextEncoder ();
console.time('TextEncoder');
encoder.encode ( content )
console.timeEnd('TextEncoder');

console.log ( content.length );

denis-migdal commented 8 months ago

It seems you should use Regex :

it seems 100% faster.
it render your code more readable and would reduce its size.

JS Regex can be unicode aware.

cf https://stackoverflow.com/questions/9862761/how-to-check-if-character-is-a-letter-in-javascript

RegExp(/^\p{L}/,'u').test(str)

See how to use unicode classes : https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Unicode_character_class_escape

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Character_class_escape

denis-migdal commented 8 months ago

Okay, so performance wise, it seems that :

Regex are the fastest (compare charCode is -50%)
Comparing chars directly is very slightly faster than comparing charCode.
Compare charCode is the slowest (-2%), and even slower when accessed through a precomputed array (-20%).

You can type unicode char in JS : "\u006F" or copy directly the unicode character "o" as it is supported.

PierreQuentel commented 8 months ago

More and more amazing ! I didn't know about this RegExp feature, /\p{Letter}/u. Yes, at first sight it could replace the use of Unicode tables and save quite a few kB in brython.js. Thanks !

PierreQuentel commented 8 months ago

In the commit above I have adapted the scripts that used the Unicode tables wherever I could. unicode_data.js is reduced to 3kB for cases where I couldn't find Javascript solutions (getting the integer value of digits in non-latin alphabets for instance).

Thanks again !

denis-migdal commented 8 months ago

Wow, you jumped from a performance score of 88% last time to 97% !!!! that is starting to be good ^^. You went from 163ms load time to 22ms !!!.

That's... enormous ^^.

I think now a big part of the load performances is due to brython.js size, but I think it can still be reduced by factorizing checks (how much I don't know though), but should also helps compression.

Maybe $B.$make_ext and run_script could still be optimized (for at most a 10% execution speed gain). But I don't think it'll be quite interesting. I think next step would be to accelerate Brython execution time, for that there are generally 3 ways :

reducing code size, either by factorizing or relying more on JS features (code would be more readable and maintainable).
some general optimizations that could lead to very very small or big execution speed increase, that might depends on what is executed. Generally we avoid doing it (time consuming for sometime very little gain), but as Brython contains code that we would generally avoid, it'd be good to clean it a little.
in the benchmarks profiling to see what takes the most time (but maybe more at the end).

I found lot of things to optimize for code size and performances, that'd be quite promising at the end ^^. We'll have to compare the benchmarks at the end xD.

denis-migdal commented 8 months ago

Still thinking about JS <=> Brython conversions...

Brython => JS : easy
- if number/boolean/null/undefined/? : returns it.
- if( '$jsobj in elem) return elem.$jsobj O(1), really big speed increase.
- if doesn't exist build it, and set elem.$jsobj : should not occurs (?), would be a very small additional cost that'd pay off quite quickly.
JS => Brython : needs more thinking
- if number/boolean/null/undefined/? : returns it or its base structure
- search it in a WeakMap : costly in memory and execution speed. BUT potentially HUGE memory gain when objects are converted/accessed several time.
  - if exists, returns the value : would still cost more excepts for non-empty/"big" arrays/functions/objects,
  - BUT REQUIRED to solve circular references : so this is a cost you need to have.
  - if doesn't exists, build it and insert it in the WeakMap : so ofc costs more, except if the same object is present several times in an array/object values.
  - BUT REQUIRED to solve circular references : so this is a cost you need to have.
Lazy JS =>Brython : convert it only upon first access (mainly solves Brython<=>JS sync issues)
- would introduce an additional cost (just a "if"... so almost nothing + a stored reference)
- would prevents converting big objects/arrays at once, so huge performance (speed and memory) gain if all properties (and its descendant properties) aren't used (which should be the case most of the time).
- would allow better JS => Brython variable synchronization.
- so this is a costs you have to have ?
- BUT this needs to use functions to access to properties, e.g. at() for arrays, a getAttr() for objects that can reduce access speed.
- but maybe you already have mecanisms like that ?

Benchmarks :

Lazy is 70% slower when accessing arrays... so you'll gain performances depending on whether you have more access than conversions.
- but I think you need it to handle JS => Brython synchronization (e.g. if the value is modified in JS)... so this is a HUGE cost, but this is a required cost ?
- also, if upon access you verify the index type and range, (like I think Array.at() is doing), you are already nearly 30% slower (so with both, you'll be at most 100% slower ?).
- maybe having 2 classes : one for JS => Brython that'd be lazy. But once deep or shallow cloned, convert it all (so remove the 70% overcost). BUT that means that it'd must then be forbidden to give it back to JS... could be used internally when the object is temporary, with lot of accesses, and won't be used/modified outside ? E.g. a : $buildTempShallowClone() and a $getTempDeepClone() ???
- OR an adaptive strat :
  - JS => Brython = lazy, a priori we may not access all elements, and JS might modify its content at any time (so we have sync issues).
  - built in Brython / using Brython API : direct access until given to JS (in such case => becomes lazy) = no sync issues. A priori we'd likely access all elements :
  - e.g. obj.at = DIRECT_ACCESS at init, and obj.at = LAZY_ACCESS when given to JS, then use obj.at(idx) to access indexes.
  - in pyObj2jsObj() change the obj.at (i.e. when giving it to JS)
  - internally, in Python code, having a flag to overwrite, pyObj2jsObj() modification of obj.at when calling JS functions we know won't store or modify the array ?
  - $buildTempShallowClone() and a $getTempDeepClone() internally when we think access time > conversion time ? [but I'd assume conversion time >> access time in most of cases ?] so the access 100% overcost might be nothing compared to conversion time. So might be a very niche usage when doing intensive accesses ?

let array2 = array.map( e => { return {obj: e} });

at = (array, idx) => {
    let val = array2[idx];
    if( val !== undefined && val.obj === array[idx])
        return val
    // we should make the lazy conversion here
    // but that wouldn't be fair.
    throw 'not implemented'
};

Imma checks for objects too.

denis-migdal commented 8 months ago

Okay, good news for objects, the overcost seems quite limited (2% slower to 10% slower). The difference seems mainly due to browser runtime optimizations I assume.

So a lazy strategy might not be an issue for objects. I think you already use something for objects, so it shouldn't cost more for objects.

brython-dev / brython

[List] Several HUGE performances optimisations. #2260

[DONE]