[OPTIMISATION] Optimisations for DEBUG informations (easily done, big performances)

brython-dev / brython

Brython (Browser Python) is an implementation of Python 3 running in the browser

BSD 3-Clause "New" or "Revised" License

6.4k stars 512 forks source link

[OPTIMISATION] Optimisations for DEBUG informations (easily done, big performances) #2270

Open denis-migdal opened 1 year ago

denis-migdal commented 1 year ago

Hi,

I suggest several small optimizations for the debug information that Brython inserts into the JS generated code :

Interesting optimisations

[ ] B. Build the constant as a mathematical expression from the 3 numbers [a,b,c] (link)
- [ ] csnt building strat that can be interesting.

Please also look at

[ ] for the frame/stack : use an uni-directionnal tree instead of a stack.

A. Use constants as index to an array, instead of arrays [NO]

Please also look at :

Currently, an array describing the position of the "executed" python code is given as the last argument of underlying JS functions, e.g. :

$B.$call($B.$getattr_pep657(locals___main__.a, 'replace', [0, 0, 9]), [2, 2, 14])('e')

This has the disadvantage of potentially building an array each time the line is executed, and takes more space in the generated JS code.

Therefore, I suggest using globally defined constants, defined during transpilation time. This would increase execution performances while reducing the size of the generated code :

DEBUG_SYMBOLS = { // see suggestion B. for a better efficient way.
    1: [0,0,9],
    2: [2,2,14]
}

The code would then be written :

$B.$call($B.$getattr_pep657(locals___main__.a, 'replace', 1), 2)('e')

This would add an insignificant cost when printing error messages, i.e. to do operation DEBUG_SYMBOLS[the_constant]. But it doesn't matter as :

we are not supposed to print 10,000,000,000 errors messages per seconds
when handling the error message others operations will be waaaay more costly (so this additional cost is insignificant)
this would increase the overall execution speed.
this would decrease the size of the generated JS code, so would reduce memory usage and increase execution speed.

B. Build the constant as a mathematical expression from the 3 numbers [a,b,c].

In the previous suggestion, we used an array to convert a constant to an array containing the information required for DEBUG. But a JS constant has 53bits exploitable.

This can be split into 3x16 bits, i.e. 3 numbers in the range [0 - 65536]. I'd argue that no python lines should ever be that long (PEP indicates lines should be <79 characters). At transpillation time, we would simply do : code :

DIGIT = [0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F]; //globally defined
NB_BITS = 16; // globally defined

// be [a, b, c] :
CSNT = "0x" + (a << (NB_BITS*2) + b << NB_BITS + c).toString(16);

Giving the following code :

$B.$call($B.$getattr_pep657(locals___main__.a, 'replace', 0x9), 0x2020E)('e')

Then, when handling the error :

a = the_constant >> (NB_BITS*2)
b = (the_constant-a) >> NB_BITS)
c = (the_constant-a-b)

This would be quite quick when handling errors, while taking less RAM memory (no needs for a DEBUG_SYMBOLS structure). Current version takes 7 characters minimum for the [a,b,c], array, this version may takes at most 8 characters, so one more (but generally would be at most 7 characters).

Of course, we can reduce the value of NB_BITS if needed, and even add the line number if we have enough bits. This makes the code more readable as we understand that 0x09 is a special value.

More opti :

With more knowledge about the 3 [a,b,c] values, we can reduce the number of bits. E.g. if a <= b <= c, instead of storing [a,b,c] into 53 bits, we can store [a, b-a, c-a] into 53bits. Then, we can attribute less bits to e.g. b-a and c-a.

Which could liberate bits to store the line numbers if interesting.

C. Replace all the `$B.set_lineno(frame, 7);` by a map ? (NO)

Instead of keeping track of the line number, which likely kills the execution performances, creating a structure enabling to convert the JS stackstrace into the Python stacktrace (this would also enable better error printing I think).

The structure would be globally defined as :

STACKTRACE_MAP = new Map(); // can be a {} if key is a string.
STACKTRACE_MAP.set(FILE_A, [ 1, 10, 23, 34, 106, .... ]);
STACKTRACE_MAP.set(FILE_B, [ 1, 10, 23, 34, 106, .... ]);

I am not sure about the value we need to put in FILE_A/FILE_B, we'll have to look at JS stacktrace more in depth when raising exceptions inside Brython code. Maybe there are also ways to cheat a little.

Getting the line numbers would then be done as :

let JS_LINENO = .... // extracted from the stacktrace.
let filemap = STACKTRACE_MAP.get(????);

let i = -1;
while( STACKTRACE_MAP[++i] <  JS_LINENO );

let BRYTHON_LINENO = STACKTRACE_MAP[i-1];

When transpiling the Python code, the filemap would be built as :

let offset = 0;
let filemap = new Array( nbLines(py_file) ) // preallocate the array. Can't have more entry than we have py lines;

let nb_lines = 0;
// for each lines :
        js_file += "the content\nor whatever\ntoto"
        nb_lines += 2 // for the sake of the example.
        filemap[offset++] = nb_lines
filemap.length = 0;

This may make transpilation slightly slower, but I'll argue that :

this isn't costly compared to the other operations made during transpilation
this prevents having to add many lines to the JS code, so less memory usage, quicker execution speed, more readable JS code.
this may give a boost during execution time as we don't have to call a function.

This will also make error printing more slow (but we don't care).

Also, with that, would we still need the frame system ? Is there other lines numbers hidden in the generated JS code ? (e.g. for brython class definition ?)

EDIT: If the generated JS file is minimified, the file numbers will not match anymore. I think the minimifier can also build a sourcemap, that will give you the good lines numbers in the stacktrace. Else, a library might be able to do the trick. Else, I'd argue that if you minimify, you want performances.

D. Option execution-mode="production" to remove debug informations, and make output cleaner/shorter. (NO)

Currently, because Brython doesn't implement sourcemaps, the generated JS code is bloated with arrays, and $B.set_lineno(frame, 7); calls.

This makes the generated code harder to read for a human, increases the generated code size, and has a cost on performances. This is really useful when developing to better understand errors and fix them. However, in production, we might want to go faster and might not be interested by debugging information.

For this usage, sourcemaps are generally used (cf more info about sourcemaps structures), but I can understand it might not be easy to make.

Hence, I suggest to implement an option execution-mode that could take 3 values :

development : the current behavior ;
production : don't include debug informations :
- don't put any $B.set_lineno in the generated code (as well for other functions with the same goal).
- put all line information (be [a,b,c], 0xXXX, or whatever) to be "0" or "[0,0,0]" : this should requires minimal adaption to Brython (if solution C. implemented, just make the structure empty to save memory space).
- could be further optimized but that'd be enough I think.
performances: like production, + removes all assert checks (I will be working on it one day).

Cordially,

PierreQuentel commented 1 year ago

Thanks for the suggestions.

I don't get the idea for A : the value of DEBUG[1] and DEBUG[2] should change before every call to $getattr_pep657, what would be the benefit ?

I am more convinced by B. We should test how much space and execution time this would bring.

For C and D :

the line number is not relevant only for stack trace in case of errors, it is also used in trace functions such as set by sys.settrace() (cf. examples in the test of module sys in the Brython test suite), I don't think we can rely on Javascript internals for that
the frame system is also used for many other things as error handling, for instance in the built-in super(), for globals() and locals(), etc.

denis-migdal commented 1 year ago

Okay... for the optimization B, the gain in performances will be HUGE.

Optimisation B seems 3.8x faster than current Brython implementation.

Screenshot 2023-10-09 at 22-17-07 Test

... BUT there is an even better way... that is x36 faster. The rational is to transpile the code like that :

$B.LID = 0x000000000007;
$B.IID = 0x000000000003; let R = $B.$getattr_pep657(locals___main__.a, 'replace')
$B.IID = 0x00020002000E; call(R)('e');
// instead of :
// $B.set_lineno(frame, 7);
// $B.$call($B.$getattr_pep657(locals___main__.a, 'replace', [0,0,3]), [2,2,14])('e')

Note it is possible to offer 3 outputs mode :

"pretty" : prints all "0" in 0x000000000004; in order to better align the lines (e.g. for the Editor).
"condensed" : prints no "0" (i.e. 0x4), and could even put it all in one line when possible (for executing).
"production" (cf below).

We use an intermediate variable (R), but the browser is clever and optimize it (so it has no cost). The interesting thing is theses $B.LID (Line ID) and $B.IID (Instruction ID) global variables.

$B.IID (Instruction ID) is global, and can be accessed from anywhere... This has several advantages :

we no longer need to pass it as a function parameter as it is available globally. This enables the browser to perform better functions optimizations.
the output is cleaner and more readable as it doesn't pollute anymore the functions parameters.

$B.LID (Line ID) is global, and can be accessed from anywhere... This has several advantages :

even when minimified, it won't be removed from the output (normally).
when parsing JS traces, we just have to put a cursor in the JS code to the position indicated by the browser stacktrace. Then, simply go back until a "$B.LID" is found. You'll then have the LineID.

And for a production output... Just don't include the $B.LID and $B.IID to the generated JS code.

denis-migdal commented 1 year ago

I don't get the idea for A : the value of DEBUG[1] and DEBUG[2] should change before every call to $getattr_pep657, what would be the benefit ?

The idea was to increment this number each time the transpiler needs to write one. But this was not the best solution, better trying the new one I found, (or B if the new one isn't possible).

For C and D :

* the line number is not relevant only for stack trace in case of errors, it is also used in trace functions such as set by `sys.settrace()` (cf. examples in the test of module **`sys`** in the Brython test suite), I don't think we can rely on Javascript internals for that

Ouch... Yeah it hurts...

Possible solutions :

asking the user to explicitly enable it before transpilation if he wants to use it ? As it seems very costly, it may be something we doesn't want to enable by default ? (then in settrace raising an exception settrace is not enabled, add the option XXXX to enable settrace).

a global function :

function LID_base(frame, lid) {
$B.LID = lid // if using the last solution. Else, for solution C, just do nothing.
}
function LID_trace(frame, lid) {
// do your stuff
}
$B.L = LID_base
// .....
$B.L(frame, 7)
do stuff
$B.L(frame, 8)

Then, when settrace is called, do $B.L = LID_trace ? I think this would help the browser to inline the function call until settrace is enabled (so trying to reduce the cost when settrace isn't used).

denis-migdal commented 1 year ago

Okay, this night I took a better look at the transpiled code from online Editor. I'll do the PR test stuff in the evening I guess.

1. Use finally (NO)

A small remark : maybe it'll be worth it to use the finally clause : try{} catch(e) {} finally {} :

you ensure that whatever happens, the code in finally will always be executed (even if throwing exceptions inside catch, returning values, etc).
you avoid errors due to code duplication (?).
you'll save few bytes in the produced JS.
it doesn't add any overcost.

Indeed, it is often written :

try {
    // do stuff
    $B.leave_frame()
    return result
}catch(err){
      // do stuff
      $B.leave_frame();throw err
}

This can be replaced by :

try {
    // do stuff
    return result
}catch(err){
      // do stuff
      throw err
} finally {
    $B.leave_frame()
}

2. Benchmarks for frame cost [NO]

It seems all the frame stuff is taking the most space (like > 50% of the produced JS ?). I think we really need a version/mode without the frame info, settrace() support, the DEBUG informations/pretty error printing, etc. in order to :

perform benchmarks : to be able to compare this "light" version with the "normal" version : how big the difference of execution speed/produce JS size is ?
being able to make the produced JS code more readable, to facilitate debug on the transpilation (and for nice JS code in the Javascript tab of the online editor).
nice bonus : enables to have a "production" mode for Brython.

The benchmarks would give us an idea of the costs of settrace() support. If it is at the cost at being, e.g. 40x slower, maybe we can have a big margin of progress and it'll be worth it to optimize. If it is e.g. 1.01x slower, maybe further optimizations isn't worth it.

3. Function declaration

The other thing that takes lot of spaces in the produce code seems to be the function declaration. But nearly half of it is due to frame.

The other thing that takes space is the $info property. Could an helper helps to reduce generated JS code size (while making the output more readable) ? E.g.

foo1382.$infos = $B.build_fct_info("foo", ....) // I think some info are identical for all functions definitions ?

I think this may comes at a price of slower execution time (though the browser might be able to inline the function) ? Though slower function declaration isn't really a matter : it is more the function execution time that matters (we define it once,but execute it multiple time).

4. Moving some operations outside of the function body ?

In each function execution the following is done :

enter/leave frame
handle returns type
handle exception
handle parameters.

Maybe it is possible to push it all inside a new function $B.execute() in order to :

cleans the function body (for editor, debugging transpilation)
reduce generated JS code size.
facilitate Browser optimizations (?).
maybe enables runtime active/deactivation of settrace (just by changing the value of $B.execute) ?
maybe enables quick BRY fct <=> JS fct conversions ???

This costs an extra function call when calling Brython functions. Dunno if browser could better inline functions thanks to that, or if this over cost is really significant. Also by reducing generated JS code size, we mechanically reduce transpilation time.

Possibility 1:

locals_exec.foo = $B.build_executor( foo1382 ) // would build/return a $execute function those 2nd argument is binded to foo1382).

Possibility 2 replace $B.call by:

$B.$execute([0, 0, 7], locals_exec.foo, 32)

I don't know which method would be better if 4. is a good idea, but that is something that could be tested.

$B.execute would then be something like :

$B.execute = function(DEBUG_INFO, fct, ...arguments){

    // handle arguments and other stuffs.
    var locals_exec_foo,
        locals
    locals_exec_foo = locals = $B.args0(fct, arguments)
    var frame = [fct.__name__, locals, "exec", locals_exec, fct]
    if(locals.$has_generators){
      frame.$has_generators = true
    }
    frame.__file__ = '<string>'
    frame.$lineno = 1
    frame.$f_trace = $B.enter_frame(frame)
    try{
      $B.js_this = this

    // enters frame
    $B.enter_frame(frame);
    try {
         let ret = fct(frame, arguments) // or fct(frame, ...arguments) ???
         // handle return type
    } catch {
           // handle exception
    } finally {
          // leave frame
          $B.leave_frame();throw err
    }
}

denis-migdal commented 1 year ago

@PierreQuentel At the top of the issue, I will put the optimization you find interesting. So that'd be easier for you to browse in this issue.

I found an easy way to get a VERY HUGE performance gain in the current system :

for the frame/stack : use an uni-directionnal tree instead of a stack.

Indeed using an array is costly :

for insertion as it sometimes needs to reallocate (which can occurs quite frequently)
for duplication as it needs to copy everything (and you use $B.frames_stack.slice() quite often).

Using an uni-directionnal tree would instead be waaay quicker :

let ptr = { previous: null, /* all your frame attribute here */ }

function enter_frame(ptr) {
      // do your stuff
      return {previous: ptr,  /* all your frame attribute here */ }
}
function leave_frame(ptr) {
      // do your stuff
      return ptr.previous
}

// usage :
ptr = enter_frame(ptr);
// frame stuff
ptr = leave_frame(ptr);

// printing the stack :
let cur = ptr
while( cur !== null ) {
     // do your stuff
     cur = ptr.previous
}

Printing the stack would be slower, but we don't care.

PierreQuentel commented 1 year ago

Salut Denis,

With the commit above, encoding and decoding the format information is delegated to functions in ast_to_js.js, so we can test several formats. Note that the number of items can be 4, in the case of binary operations, to correctly trace errors such as

Traceback (most recent call last):
  File "http://localhost:8000/tests/test.html#__main__", line 1, in <module>
    1 / 0
    ~~^~~
ZeroDivisionError: division by zero

denis-migdal commented 1 year ago

EDIT: better idea after : https://github.com/brython-dev/brython/issues/2270#issuecomment-1754841797

Mmm... testing something like that would be interesting (in size of the JS output, transpilation time, and execution time).


// Total: 32bits, assuming A < 256, B-A < 256, C-B < 256, and D < 256.
// At most : position is 2+8 characters : "Ox + 2hex chars for each numbers.
// Each bit added or removed is +/- 1 character (in hex 1 characters = 4 bits).
// We can increase the bits up to 13 (so values up to 8191).
// I think it'll be safe to go down to 7 bits (123). One less bit would be a risk (63)
// [a,b,c,d] is at least 9 characters. 7 bits is at most 9 characters (but usually, d = 0/undefined, therefore in reality would very often be 7characters, or even 5-6 when b=a, which is also very often the case ).
const NB_BITS_A = 8;
const NB_BITS_B = 8;
const NB_BITS_C = 8;
const NB_BITS_D = 8;

const MASK_A = (1 << NB_BITS_A ) - 1
const MASK_B = (1 << NB_BITS_B ) - 1
const MASK_C = (1 << NB_BITS_C ) - 1
const MASK_D = (1 << NB_BITS_D ) - 1

// we assume d is more likely to be = 0, then b, then a. c is the least likely to be 0.
const OFFSET_D = OFFSET_B + NB_BITS_B;
const OFFSET_B = OFFSET_A + NB_BITS_A;
const OFFSET_A = NB_BITS_C;
const OFFSET_C = 0;

function encode_position(a, b, c, d = 0){
    // assuming a <= b <= c (if we had more info about "d" we could do something with it) :
    c -= b;
    b -= a;

    return "0x" + ( (d << OFFSET_D) + (b << OFFSET_B) + (a << OFFSET_A) + (c << OFFSET_C) ).toString(16); // condensed format
}

$B.decode_position = function(pos){
    let d = (pos >> OFFSET_D) & MASK_D
    let b = (pos >> OFFSET_B) & MASK_B
    let a = (pos >> OFFSET_A) & MASK_A
    let c = (pos >> OFFSET_C) & MASK_C

     // assuming a <= b <= c (we do the reverse of what we had done in encode).
    b += a
    c += b

    return [a,b,c,d]
    // or if d = 0 isn't supported :
    if( d === 0)
        return [a,b,c]
    else
        return [a,b,c,d]
}

123 chars would correspond to :

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Which seems a lot for an indentation (start of the "error" ?) as well as for a function name.

denis-migdal commented 1 year ago

EDIT: better idea after : https://github.com/brython-dev/brython/issues/2270#issuecomment-1754841797

The issue of my previous proposition is that :

we assume a maximal value for [a,b,c,d] ;
if the numbers are below this maximal, we might take characters for nothing.

Another representation possible, is using the last digit (4bits) to code the number of bits taken by each value (0->F => 0 to 15bits). (We could also use 3bits instead of 4 then 0->7 => 3 to 10bits per values). This would solve the 2 issues of the previous proposition. I think this may be one of the most optimal strat memory wise, but would require some computations to find the number of bits to use for each values. This process is a little hard to explain.

But I think I have an simplier strat :

Be (a1, a2, a3), (b1, b2, b3), (c1,c2,c3), (d1,d2,d3) the bits of a, b, c, d, we store pos as the bits : (a1,b1,c1,d1,a2,b2,c2,d2,a3,b3,c3,d3).

The order of bits doesn't matter : (a1,b1,c1,d1) is 4 bits, so 1 character. As long as one of them is not-null, one char will be taken. This means that the size of the pos is 2 (the "0x" + log2( max(a,b,c,d) ).

i.e. most of the time not more than 2+6 characters (=8 characters). (values up to 64). Can easily go to the max (13bits), i.e. values up to 8192.

function encode_position(a, b, c, d = 0){

    // assuming a <= b <= c (if we had more info about "d" we could do something with it) :
    c -= b;
    b -= a;

    let pos = 0;
    let offset = -1    

    while( a+b+c+d !== 0) {
         pos |= (a & 1) << ++offset;
         a >>= 1

         pos |= (b & 1) << ++offset;
         b >>= 1

         pos |= (c & 1) << ++offset;
         c >>= 1

         pos |= (d & 1) << ++offset;
         d >>= 1
    }

    return "0x" + pos.toString(16); // condensed format
}

$B.decode_position = function(pos){

    let pos = 0;
    let offset = 0    

    while( pos !== 0) {

         a |= (pos & 1) << offset;
         pos >>= 1

         b |= (pos & 1) << offset;
         pos >>= 1

         c |= (pos & 1) << offset;
         pos >>= 1

         d |= (pos & 1) << offset;
         pos >>= 1

         ++offset
    }

     // assuming a <= b <= c (we do the reverse of what we had done in encode).
    b += a
    c += b

    return [a,b,c,d]
    // or if d = 0 isn't supported :
    if( d === 0)
        return [a,b,c]
    else
        return [a,b,c,d]
}

denis-migdal commented 1 year ago

Okay, let's do better :

let's store our numbers as [ d1,d2,d3, b1,b2,b3, c1,c2,c3, a1,a2,a3,], that'd be faster and easier to do compared to the previous solution.
the number of bits is nb_bits = log2( a | b | c | d)*4, so to indicate it, we'll just set the nb+1+2-th (starts at 0) bit to "1" (i.e. we are able to store the number of bits we use in only 1bit of information !).
we'll reserve 2 bits, to store whether we have a 'd', and whether a==b.

Here what the code should looks like (not tested) :

function encode_position(a, b, c, d = c){

    // assuming a <= b <= c <= d (if we had more info about "d" we could do something with it) :
    d -= c;
    c -= b;
    b -= a;

    let nbbits = Math.log2(a | b | c | d);

    let pos = 1 << nbbits; // our flag.

    has_d = +(d !== 0)
    has_b = +(c !== 0)

    let indic = has_b + (has_d << 2)

    // multiplication prevents condition ? - should be faster ?
    pos |= d
    pos <<= nbbits * has_d

    pos |= b
    pos <<= nbbits * has_b

    pos <<= (nbbits+2)

    pos |= c << (nbbits+2)
    pos |= a << 2
    pos |= indic

    return "0x" + pos.toString(16); // condensed format
}

$B.decode_position = function(pos){

    let has_b = pos & 0x1
    let has_d = (pos & 0x2) >> 1
    let nb_values = 2 + has_b + has_d
    pos >>= 2

    let nbbits = Math.log2( pos ) / nb_values //TODO verify value...
    let mask = (1 << nbbits)-1
    a = pos & mask
    pos >>= nbbits
    c = pos & mask
    pos >>= nbbits

    // multiplication prevents condition ? - should be faster ?
    b = (pos & mask) * has_b
    pos >>= nbbits * has_b

    d = (pos & mask) * has_d
    pos >>=nbbits * has_d

    if( has_d !== true )
         return [a,b,c]
    return [a,b,c,d];
}

PierreQuentel commented 1 year ago

About finally in try/catch blocks: it was used in early versions of Brython, but I removed it because of performance issues. They are less important now, but still around 10% on Chrome between these versions (g is faster than f):

function f(){
    try{
        x
    }catch(err){
        err.message
    }finally{
        3 + 4
    }
}

function g(){
    try{
        x
        3 + 4
    }catch(err){
        err.message
        3 + 4
    }
}
g()

The link you put on "it doesn't add any overcost." seems broken.

PierreQuentel commented 1 year ago

About "2. Benchmarks for frame cost": we really can't do without frames management. It is not used only to handle exceptions: if you remove it, built-in functions like exec(), eval(), globals() or locals() won't work, neither will super() or del <name>, etc.

PierreQuentel commented 1 year ago

More generally: one of the design principles in Brython is that it should be as close to Python as possible; between performance and compliance with the language definition, I always favour compliance.

This is why I am opposed to options at the user's choice to determine which parts of Brython should or should not work as expected by a Python developer. It's too difficult to explain (for us) and to remember (for users) which options should unplug which feature, and it would make debugging difficult if unplugging a feature has side-effects we had not thought of.

PierreQuentel commented 1 year ago

About the replacement of position information: are you sure there will be such an increase in performance ?

I tried 2 different versions, one with an array and one with an hex, there is no difference on Firefox or Chrome:

N = 10000000
function ignore(){}

var t0 = Date.now()
for(var i=0; i<N; i++){
    ignore([1,2,3])
}
console.log(Date.now() - t0)

var t0 = Date.now()
for(var i=0; i<N; i++){
    ignore(0x999)
}
console.log(Date.now() - t0)

Also tested on jsPerf, with even 0.3% improvement with arrays...

denis-migdal commented 1 year ago

About finally in try/catch blocks: it was used in early versions of Brython, but I removed it because of performance issues. They are less important now, but still around 10% on Chrome between these versions (g is faster than f): Hmm, strange, I do have a small differences in perfs, but not at this point.

The link you put on "it doesn't add any overcost." seems broken. Mmm... it's strange. Here a new link : https://jsperf.app/wuxivi

Well, it seems finally isn't worthing it then.

denis-migdal commented 1 year ago

Thanks for your answser.

About the replacement of position information: are you sure there will be such an increase in performance ?

Yes : https://jsperf.app/mitogo In this jsperf : arrays are x6 slower on FF, and x6.4 slower on Chromium.

Ofc, the increase is only for this part of code, so you won't have an overall x6 speed increase.

I tried 2 different versions, one with an array and one with an hex, there is no difference on Firefox or Chrome:

Your function ignore() is empty and the browser is noticing it. So the function is likely inlined, i.e. the loop is likely doing nothing ? However, when the browser sees that the value is likely to be used, and has no insurance that it'll never be modified (cf my jsperf above), it is forced to create the array at each iteration (though it might still have some optimizations under the hood).

Also tested on jsPerf, with even 0.3% improvement with arrays... 0.3% is in the margin of error, so we could say it performs the same. Which makes me think the ignore() function was indeed inlined.

Else, on "stylish"/"readability" considerations, if it doesn't cause performances reductions (ofc), wouldn't it be better to put the instruction ID as first parameter (but I guess putting it last is better as you can do --arguments.length) ?

$B.$call([2,2,14], $B.$getattr_pep657([0,0,3], locals___main__.a, 'replace') )('e')
// or
$B.$call(0x22E, $B.$getattr_pep657(0x3, locals___main__.a, 'replace') )('e')

I'm picking at straws here, but it is easier to visually associate the position to the called function (and if many lines, align the code better). In current implementation the code for $B.$call is quite far away, at the total opposite of the line : $B.$call($B.$getattr_pep657(locals_main.a, 'replace', [0, 0, 9]), [2, 2, 14])('e')

But yeah, I'm really picking at straws here xD.

I guess the only optimization we could do on the transpiler production are :

the [a,b,c,d] representation
the current datastructure of the stack/frame (e.g. using a tree -if necessary with a pool-, instead of an array - I assume the stack is an array).
the datastructure of functions arguments (maybe there is something clever to do ?). ~~The structure for kw seemed strange to me (from memory something like: {$kw=kw, val:[{x: 3}, argvv]}) - why not doing : {$kw={x: 3, ...argvv} ?), but I won't look too deep into that.~~
I also didn't really understood method call (*)
the $info of functions, maybe an helper function to create it, reducing code size, hopping that wouldn't have an impact too big on perfs (if the browser inline such function, it could work) ?
moving some preparations outside of the function itself, e.g. putting the frame_enter()/frame_leave()/arguments/locals preparation inside $B.$call ? Trying to reduce the generated JS code size ?
- Maybe there would be then ways to have some kind of optimized $B.$call_loop when calling a function in a loop ? (I really don't know).

(*) x.foo() is converted into :

$B.$call($B.$getattr_pep657(locals_exec.x, 'foo', [0, 0, 5]), [2, 2, 7])()

Which seems strange to me as I'd expect to see either :

$B.$call(locals_exec.x, $B.$getattr_pep657(locals_exec.x, 'foo', [0, 0, 5]), [2, 2, 7])()
// or
$B.$call($B.$getattr_pep657(locals_exec.x, 'foo', [0, 0, 5]), [2, 2, 7])(locals_exec.x)

Making me think that $B.$getattr_pep657 is building a new anonymous function when trying to access to the property ?

return function(obj, attr) {
     let fct = obj[attr]
     // some checks
     return function(args) {  fct.call(obj, args)  }
}

Then there could be ways to optimize it a little I guess.

I'll try to hide messages that are not relevant anymore to keep this thread clean.

PierreQuentel commented 1 year ago

Your test https://jsperf.app/mitogo modifies a global variable, which is never the case in Brython code (or unwillingly...).

If you set a local variable, there is no difference, though it's likely that the data is initialized : https://jsperf.app/xofeta.

denis-migdal commented 1 year ago

Your test https://jsperf.app/mitogo modifies a global variable, which is never the case in Brython code (or unwillingly...).

This isn't a global variable (it isn't in window) : it is a local variable.

If you set a local variable, there is no difference, though it's likely that the data is initialized : https://jsperf.app/xofeta.

The browser sees that the variable is assigned but never used.

The browser can be very tricky on small part of codes. I'll check again in few days.

denis-migdal commented 1 year ago

I made a test to prevent any agressive Browser optimisation, the conclusions are :

Arrays are very very very very slow.
Chromium is drunk, seriously.

The explanation is quite simple: the array needs to be rebuilt at each iterations which is costly and require dynamic memory allocation. Whereas the number is constant and can be stored on the stack. Using arrays takes also increase memory usage if lots of functions are called without interruptions, as the garbage collector might not have the time to pass and liberate the memory.

- Test 1 : https://jsperf.app/xofeta/2

Firefox: Screenshot 2023-10-11 at 17-43-03 Array vs hex vs number (v2)

Chromium: (WTF????) Capture d’écran_2023-10-11_17-44-00

- Test 2 : doing nothing but a check ( https://jsperf.app/xofeta/3 )

Screenshot 2023-10-11 at 17-48-09 https __jsperf app Capture d’écran_2023-10-11_17-56-06

PierreQuentel commented 1 year ago

Salut Denis,

I found an easy way to get a VERY HUGE performance gain in the current system : for the frame/stack : use an uni-directionnal tree instead of a stack.

Thanks for the suggestion. I have implemented it in commit 44ba0d16823857b10ad17e6d4d89bc3712dfc07c (many things to change in many places !). The performance gain is not huge; test results vary from one run to another, but there is nothing really noticeable. If you have tools to compare this version to the previous one it will help.

PierreQuentel commented 1 year ago

Could you also test various ways of encoding position info (compared to the current implementation with an array) and see if it actually improves the overall performance ? Not only unit tests as you did, but tests with complete Brython code, as in the integrated speed measurements made by _speed/makereport.html.

denis-migdal commented 1 year ago

Hi,

I'll have to take some time to write my answer as there are lot to say and some tricky things to explain.

Else, during this week, I worked on the proposal for asynchronous import, and on its redaction. It is almost done, and I think I'll post it during this WE.

denis-migdal commented 1 year ago

Optimisation in Brython is quite tricky.

Not only we have the constrainsts to mimic CPython, thus requiring some "costly" way of doing things... but we are also in JS, on a Browser, which is quite tricky... and I discover new nuances each days.

Is an unidirectionnal tree faster than a dynamic-allocated array for insertions and "branches" ? YES, no doubt about that.
Is JS using a "dynamic-allocated array" ? Errr... not sure... for dense arrays, yes... for parse arrays it uses a different structure that should be nearly equivalent (?) to trees for insertions. Unfortunately, we have no guarantee about which internal structure the browser will choose. Normally it should be a dense array in our case... We can force a parse array by starting filling the stack from the end (index 999) and decreasing the index at each new element (next element would then be at index 998).
Still, is a unidirectionnal tree faster than JS arrays for "banches" ? YES, no doubt about that. It'd also save on memory usage.
In real-life do we really get big stack ? It depends. For recursive functions for exemple you are likely to get big stack. In other cases, it depends on your code.
When exactly does Brython perform theses "branches" (the .slice()) ? I don't know, so I can't really test it.
Is this kind of things kills performances in the general case ? YES, no doubt about that. Even more when this is a peace of code called this often.
Is Brython the general case ? Errr... no... it performs lot of other costly operations (e.g. processing of parameters in an empty functions is 38% of total execution time).
Can we still expect increases in performances with that ? YES, up to 10% of total execution time (due to enter_frame()).
Then can we observe a gain in performances with the new version ? NO. The stack size is limited to 1000, which is roughtly 5ms of execution time for a recursive function call. This is not enough to be able to notice a <10% difference. We'd need a way bigger stack size to be able to test it (and I couldn't test the "branches" as I do not know exactly when they occurs).
Would a pool help improving performances ? Maybe. It'll depends whether the operation of fetching/giving back the object from/to the pool is less costly or not compared to simply building a new one.
Why do you announce e.g. "x40 faster" when in real life I see no significative gains ? We have to distinguish the increase of speed on a very limited piece of code, and the global execution time. If your piece of code is responsible for 10% of total execution time, making it 40x faster would not give you more than a -10% global execution time. And I have to confess, I didn't expected Brython to do so much costly things, often due to the fact we need to mimic CPython behavior.
Can we still significanlty increase Brython performances ? YES. For example, optimising parsing of arguments (38% of global execution time) is likely to produce noticeable effects. The issue is mainly mathematical :
- imagine a code taking 1000ms to execute, and an optimisation that helps you to win 25ms, it's not even worthing it. But if you have another optimisation that make your code 20x faster, your code would now runs in 50ms. Then a 25ms reduction of execution time is now a x2 speed increase so the previous optimisation starts to become interesting.
- imagine 9 optimisations, each making you win 10% of execution time (additive). They are the same, but the first would make you win only 10% of execution time, while the last, after the 8 others has been made, would give you a x2 speed increase, and the 9 together are a x10 speed increase.
Finally do we really care about theses optimisations, is it really worthing the trouble ? Errr... difficult to say. Brython still seems to execute relatively quickly and DOM operations would likely still be quite expensive anyways. Some code cleaning (that can also lead to some speed increase), documentation, ... might be currently more important than optimisation ?

denis-migdal commented 1 year ago

Maybe you can also point me to some Brython scripts that you think are quite slow to execute (e.g. compared to CPython), and I'll investigate where the cost is coming from ?

denis-migdal commented 1 year ago

as in the integrated speed measurements made by _speed/makereport.html.

It seems the directory www/tests/ace/ is missing.

PierreQuentel commented 1 year ago

Correct, I have removed the Ace scripts from make_report.html.

denis-migdal commented 1 year ago

I'll have to get a deeper look on the documentation.

I'm downloading the Brython archive from the git, I set a webserver on www, but I get some JS errors on www/speed and www/speed/make_report.html.

PierreQuentel commented 1 year ago

Things are improving, thanks to your suggestions in the past days (encore merci !) and an optimization in the creation of class instances (commit cbaccc078cf1a0e8714c5986aabab8839b1b1601). For the first time I think, none of the tests in make_report.html is more than 10 times slower than CPython !

I have to confess, I didn't expected Brython to do so much costly things, often due to the fact we need to mimic CPython behavior.

Welcome to my world ;-)

Maybe you can also point me to some Brython scripts that you think are quite slow to execute (e.g. compared to CPython), and I'll investigate where the cost is coming from

As you noticed, parsing function arguments is complex because of the flexibility offered by Python. I see you had ideas to improve the algorithm, if you can focus on this topic and provide speed improvements, even limited, it would be great.

The regular expression module (in libs/python_re.js) is also much slower than the re module in CPython, but the code is probably very hard to follow...

PierreQuentel commented 1 year ago

I'm downloading the Brython archive from the git, I set a webserver on www, but I get some JS errors on www/speed and www/speed/make_report.html.

You must run the script server.py at the root of the Brython repository, it is designed to handle the Ajax requests sent by make_report.html

denis-migdal commented 1 year ago

You must run the script server.py at the root of the Brython repository, it is designed to handle the Ajax requests sent by make_report.html

Hum, the following error is printed when executing server.py :

Traceback (most recent call last):
  File "/tmp/brython-master/server.py", line 243, in <module>
    exec(make_doc)
  File "<string>", line 38, in <module>
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/brython-master/www/static_doc/3.12'

Things are improving, thanks to your suggestions in the past days (encore merci !) and an optimization in the creation of class instances (commit https://github.com/brython-dev/brython/commit/cbaccc078cf1a0e8714c5986aabab8839b1b1601). For the first time I think, none of the tests in make_report.html is more than 10 times slower than CPython !

I'm happy to hear that ^^ The last optimisations I suggested seemed unfortunately quite disappointing.

As you noticed, parsing function arguments is complex because of the flexibility offered by Python. I see you had ideas to improve the algorithm, if you can focus on this topic and provide speed improvements, even limited, it would be great.

It seems that even if Python have some flexibility, there are still some rules we can exploit to speed up the process.

I think the process would be to rewrite it from scratch, and start with the more easy case (no arguments), and to benchmark it. Add a new case (e.g. "normal arguments"), and re-benchmark it on the 2 cases to see if a significative cost is added on the first case, and so one. That'd enable to see when the addition of a new parameter would "costs" a lot, and if we need to split stuff.

In order to test that, I'll have to find a way to extract the function in order to execute it "alone" while not having the whole function call. That'd improve reproducibility and would make JS test easier. Maybe I should create a small git in order to do the benchmarks and adaptations little by little.

The regular expression module (in libs/python_re.js) is also much slower than the re module in CPython, but the code is probably very hard to follow...

Aren't CPython regular expression the "same" as JS regular expressions ?

PierreQuentel commented 1 year ago

The bug when starting server.py should be fixed by the commit above.

Aren't CPython regular expression the "same" as JS regular expressions ?

How I wish it would be the case ! But no, the Python engine has a slightly different syntax, and features that the JS engine doesn't have... See this page for the ever-evolving list of differences. [EDIT] this page is 7 years old, the differences have changed a lot since then

As always, the JS engine is available in Brython as window.RegExp, there is an example in the demo page.

denis-migdal commented 1 year ago

Humm.....

I'd like to test argument parsing like that :

I declare my function in Brython, so that I can modify it without bothering copy/pasting things from Brython's editor.

from browser import window

            def run():
                print('ok')

            window.run = run

I'm now able to construct mock call to test different algorithms :

setTimeout( () => {

                function test() {

                    let foo = __BRYTHON__.jsobj2pyobj(window.run);
                    console.log(foo);
                    console.log( __BRYTHON__.args0( foo, arguments ) );
                }

                test('ok')

            }, 1000)

The issue is that jsobj2pyobj() doesn't give me back the original run() function and the constructed object seems to lack some data in $infos.

denis-migdal commented 1 year ago

The bug when starting server.py should be fixed by the commit above.

Thanks

How I wish it would be the case ! But no, the Python engine has a slightly different syntax, and features that the JS engine doesn't have... See this page for the ever-evolving list of differences. [EDIT] this page is 7 years old, the differences have changed a lot since then

Yeah so it's 99% the same, but as always it's the 1% we almost never use that fucks up everything xD.

So the solution is either to have some "conversion" from Re regex to JS regex, which may be a little hard to manage (and introduce an additional cost), or to implement the whole Re regex in JS...

Yeah that's fucked up whatever you choose. xD

denis-migdal commented 1 year ago

Yeah, 4k lines for re...

I could take a look to what takes the most time... but I'm afraid that wouldn't do miracles...

I assume python "re" is implemented in C++ (or something like that) in some python implementation. If you can somehow compile it to WASM, maybe you could get the full implementation while having some perf increases ???

PierreQuentel commented 1 year ago

I assume python "re" is implemented in C++ (or something like that) in some python implementation. If you can somehow compile it to WASM, maybe you could get the full implementation while having some perf increases ???

It is written in C, the files are here. I have no idea how to convert it to WASM or Javascript.

denis-migdal commented 1 year ago

It is written in C, the files are here. I have no idea how to convert it to WASM or Javascript.

For the current implementation, did you use this file that you rewrote by hands, or did you implement your own regex engine from scratch ?

As it is in C, there is almost no way a JS version would be faster than CPython.

To convert it to wasm, I think there are tools like emscriptem. But tbf I am not sure how the results would be.

PierreQuentel commented 1 year ago

Yes, I wrote everything by hand. Adapting the C code (as I did for other scripts, libs/maths.js for instance) was too difficult.

denis-migdal commented 1 year ago

I see, you are quite courageous xD.

Then it may be that you have some algorithm issue, or something of the kind. Testing emscriptem to build the WASM may be interesting to do. I'll see if I can build a student project on that xD.

Else, could it be possible to get jsobj2pyobj(window.run) returning the original brython function ? That'd help me quite a bit to test arguments parsing.

PierreQuentel commented 1 year ago

Else, could it be possible to get jsobj2pyobj(window.run) returning the original brython function ? That'd help me quite a bit to test arguments parsing.

If it's only for your tests, the best is to hack the function to make it return what you need

denis-migdal commented 1 year ago

If it's only for your tests, the best is to hack the function to make it return what you need

I think it'd be best practices that :

let Y = jsobj2pyobj( pyobj2jsObj( X ) ) 
X === Y # they are the same object.

I think such symmetry might prevents issues.

Else, I found a little workaround :

  from browser import window

  def run(i):
      print('ok')

  def getRun():
      window.run = run

  window.getRun = getRun

let orig = __BRYTHON__.pyobj2jsobj;
  __BRYTHON__.pyobj2jsobj = (args) => {
      let ret = orig(args);

      ret.$orig = args;

      __BRYTHON__.pyobj2jsobj = orig;

      return ret;
  }

  window.getRun();
  let foo = window.run.$orig;
  console.log('run is', foo);

  console.log( __BRYTHON__.args0( foo, ["ok"] ) );

}, 1000)

So now I can test it a little better (maybe more tomorrow). Maybe more during the WE. I think I'll open a git specifically for arguments parsing it order to be able to revert and do other things.

denis-migdal commented 1 year ago

I still have some JS errors related to ""ace" on the speed pages.

I think I'll start to do some micro optimizations on the files (array preallocation, using const/let instead of some var, change some algorithms, etc). I won't test for the speed as it'd be likely insignificant most of the time (with sometimes a nice surprise in perfs :) ). I will favor code readability, i.e. I won't touch suboptimal loops like for X in Y and for X of Y (using for(let i = 0; i < Y.length; ++i) is in fact faster), unless you want me to replace them also (but I am not sure the gain in perf is worth it). I think I'll replace the .forEach() though.

The goal would be to clean the code a little, reduce a little memory operations (e.g. when preallocating arrays) and helping browser optimizations (e.g. using const when it applies).

There are ~50k lines to look at, so I'll do it little by little.

denis-migdal commented 1 year ago

It is written in C, the files are here. I have no idea how to convert it to WASM or Javascript.

Found an interesting link for it : https://openhome.cc/eGossip/WebAssembly/C2Wasm.html

Gotta check on it one day.

brython-dev / brython

[OPTIMISATION] Optimisations for DEBUG informations (easily done, big performances) #2270

Interesting optimisations

Please also look at

A. Use constants as index to an array, instead of arrays [NO]

B. Build the constant as a mathematical expression from the 3 numbers [a,b,c].

C. Replace all the $B.set_lineno(frame, 7); by a map ? (NO)

D. Option execution-mode="production" to remove debug informations, and make output cleaner/shorter. (NO)

1. Use finally (NO)

2. Benchmarks for frame cost [NO]

3. Function declaration

4. Moving some operations outside of the function body ?

But I think I have an simplier strat :

C. Replace all the `$B.set_lineno(frame, 7);` by a map ? (NO)