[OPTIMISATION] improve performances upon fct call

Summary

Code cleaning:

[ ] Don't systematically put try/catch everywhere (more info here)
[ ] build_fct_info() to reduce generated code size ??? (more info).
[ ] Move some operations outside of function body ??? (more info).

On AST code generation:

[ ] Rewrite method calls from $B.call( $B.getattr(obj, f) )(args) to $B.callM( obj, f )(args) (more info)
[ ] Rewrite function calls from $B.call( f )(args) to $B.callM( f , args) (more info / 4% total exec time)
[ ] Use function foo(...args) instead of arguments ??? (more info)

Potential optimisations:

[ ] Having a member .$getCallable() on callable object/functions (more info / 4% total exec time)
[ ] Rewrite for in range (more info -6.7% of total exec time ) - requires integer to be implemented as BigInt.

Done

[x] enter_frame / leave_frame / frame = [] : instead of a stack use a undirectional tree with a pool (more info / 18.4% total exec time).
[X] Rewrite argument parsing algorithm (more info here and here / 38% of total exec time).
[X] An option "opti.settrace-support" (defaulting to true) [-0.8% total exec time] (here and here)
[X] Put the decode_position in the catch (more info)**

Other:

Profiling results (here).
Executing transpiled code (more info)
testing code in JSPerf ( see ).

====================================================================

Hi,

The support of settrace() adds 2 if in each functions, I tested several ways to implement it to see if we can improve its performances ( https://jsperf.app/fisuce ). As always, Chromium optimisation produces strange results on short examples.

A. Use functions to precompute the condition :

One way that seems to allow optimisation is to replace the condition by a function call, i.e. instead of doing something like :

function foo_A(a,b) {
    let frame = []
    frame.$f_trace = $B.enter_frame(frame)
    try {
        let result = a+b
        if(frame.$f_trace !== _b_.None){
            $B.trace_return(result)
        }
        return result;
    } catch(e) {

        if(frame.$f_trace !== _b_.None){
            frame.$f_trace = $B.trace_exception()
        }
    }
}

doing :

function foo_B(a,b) {
    let frame = []
    const trace = frame.$f_trace = $C.enter_frame(frame)
    try {
        let result = a+b
        trace(result)
        return result;
    } catch(e) {
        trace.trace_exception()
    }
}

Ofc this is a mock example. But it as the advantage of reducing the generated code size, improve code readability, and can lead to execution speed increases (it seems at least as fast as the current method, and sometimes faster on Chromium).

2.8x faster on Firefox.
1.9x faster on Chromium (but he is doing strange optimization, so don't take it at face value).

The condition for that is to precompute the condition when calling sys.settrace() :

const NO_TRACE = function(_arg){};
NO_TRACE.trace_exception = function(){};

function getNO_TRACE() { return NO_TRACE }

$B = {
    enter_frame: getNO_TRACE
};

And once sys.settrace() is called, we change the value of $B.enter_frame. Then, it may be less efficient, but debugging isn't made to be efficient.

It'll likely need some tweaks/test to see if small changes can produces better results.

B. An option "opti.settrace-support" (defaulting to `true`) :

According to the python documentation :

CPython implementation detail: The settrace() function is intended only for implementing debuggers, profilers, coverage tools and the like. Its behavior is part of the implementation platform, rather than part of the language definition, and thus may not be available in all Python implementations.

We could safely assume that a Brython user may want to use settrace() when developping, but likely won't need it when deploying its Website.

Hence, I suggest the addition of an "opti.settrace-support" defaulting to true (so with support of settrace(). Then, when users wants to deploy its Website, and have better performances, he could disable this option.

Once disabled, this options won't include the settrace lines in the produced JS code, so won't print :

if(frame.$f_trace !== _b_.None){
        $B.trace_return(result)
 }

and

if((! err.$in_trace_func) && frame.$f_trace !== _b_.None){
        frame.$f_trace = $B.trace_exception()
}

(and maybe other lines).

With this option disabled, functions calls could be x10 faster on FF, and x1.8 faster on Chromium.

Cordially,

Denis, I just wanted to say that I'm really enjoying reading your investigations into Brython performance. Coming up with these accurate benchmarks is difficult and also very helpful. I wish I had more time right now to help. Pierre, I agree with you that keeping Brython equivalent to CPython is more important than improving performance. But if we can have it all that's great! Thank you both.

In the commit above I have added a flag "trace" in ast_to_js.js, if set to 0, the code for traces is not generated.

In ran the built-in speed test (_/speed/makereport.html) which generates _/speedresults.html on Firefox. The result for function calls is disappointing : there is almost no difference with or without trace.

small improvement for this test ('function call'), but in the error margin of the test
```
def f(x):
return x
for i in range(1000000):
f(i)
```

no difference for this one ("function call, complex arguments")

def f(x, y=0, *args, **kw):
return x
for i in range(100000):
f(i, 5, 6, a=8)

This is very far from the x10 improvement you mention for Firefox.

Hummm...

This tends to indicate that function calls are doing something way more expensive compared to the little tests I made, making the gain in speed almost insignifiant (I don't think this is due to browser optimisation this time). If I had to make an educated guess, I'd say that maybe the stack handling is killing performances (creating, then adding, an object to an array (stack) at each function calls).

But I'm a fucking idiot... I could have simply used the editor to generate the JS and execute it with the browser tools to see the performances and what cost the most time... I should have done that in the first place to get "realistic" tests instead of relying on little mocks made in JSperf...

I'm sorry, I really didn't though about it.

For me it was clear that we'd have a speed gain. But didn't though it'd be this insignificant in reality, due to the things I didn't included in my tests.

Maybe it'd be possible to use brython.js in JSPerf, and copy the JS generated by the editor into the "Current version" test, then copy-paste and modifying it to create other tests ?

I really didn't though about it, and I don't know why.

When copying "brython.js" in "Setup JS", JSPerf sends errors (I assume the file is too big for that). When using the "HTML Setup", it works a little better, but some errors are thrown :

Uncaught DOMException: The operation is insecure.
    <anonymous> https://raw.githack.com/brython-dev/brython/master/www/src/brython.js:22
    <anonymous> https://raw.githack.com/brython-dev/brython/master/www/src/brython.js:129
brython.js:22
    <anonyme> https://raw.githack.com/brython-dev/brython/master/www/src/brython.js:22
    <anonyme> https://raw.githack.com/brython-dev/brython/master/www/src/brython.js:129

TypeError: $B.imported is undefined
    uid1697192098634createFunction https://jsperf.app/sandbox/6529186ec7d9b980f2758267 line 1 > injectedScript:6
    NextJS 6
        ec
        ec
        ed
        run
        d
        tJ
[id]-31be30a7d642ff84.js:1:22868

Funny enough the randomly generated name of my test is https://jsperf.app/pabeku (reading "Pas beaucoup" with an accent). Which shows the unsatisfactory results we got xD.

I will now try to see the performances through an execution on the Editor. Sorry, I should have done that from the start...

But now that we have this, could it be possible to build brython_stdlib.js and compare its weight before and after ? I'd assume it'd quite small, but is it like 5% or 0.0005% ?

It seems that this is $B.indexedDB=_window.indexedDB; that causes this issue. Could it be possible to have an option to disable it so that we could use tools like JS perf ?

Here the test I made.

It takes ~1min to execute, please find below the stack trace. Do you know what is _b_.eval ? It feels really strange that such a function would be the things that takes the most time.

Compared to this, the function call is 2sec, so ~3.7% of the total execution time.

Capture d’écran_2023-10-13_12-37-00

Here a clean trace I made locally on a clean HTML page (you can import it in the "Performance" tab of Chrome dev tools).

Trace-20231013T131633.json.zip

There is an anonymous function call that takes most of the time.

Capture d’écran_2023-10-13_13-42-13

loop7 seems to be my loop function, and f6 my f function. I think I first need to convert the Brython file into JS in order to be able to profile it better, but copying the JS code from the editor doesn't seems enough to be able to execute it (BRYTHON not found).

Here the Brython code

    <!DOCTYPE html>
    <html>
        <head>
            <!-- Required meta tags-->
            <meta charset="utf-8">
            <title>X</title>

            <!-- Brython -->
            <script src="https://raw.githack.com/brython-dev/brython/master/www/src/brython.js"></script>
            <!--<script src="https://raw.githack.com/brython-dev/brython/master/www/src/brython_stdlib.js"></script>-->

            <script type="text/python">
from browser import document

import time
start = time.time()

def f(i):
    return None

def loop():
    for i in range(100000000):
        f(i)

loop()

end = time.time()
document <= "Done in " + str(end - start)
            </script>

        </head>
        <body>
        </body>
    </html>

This function seems to take the most time (brython.js:5532 ):

return function(){try{return callable.apply(null,arguments)}catch(exc){$B.set_exception_offsets(exc,position)
throw exc}}}

And this is not code called from it, as its self-time is almost equal to its total time.

Are you making it catching exceptions continuously that you catch, rethrow, and ignore elsewhere ???

When executing with the loop alone it takes 3sec (but function loop7 only takes 1sec to execute). With :

no return/no argument : 26sec
with 1 parameter : 38sec
with 1 return None : no differences.

The other hypothesis is that the browser does always the same thing to it cache the results, hence f6 only takes 15ms execution time, and most of the line 5532 execution time is due to the browser looking to its cache ?

Though, this function should produce side effects so shouldn't be optimized at this point ???

Then, maybe the optimisation I suggested made no difference in this example due to this optimisation of calling a function in a loop, but would manifest in real-life situation ???

It's so strange.

$B.$call=function(callable,position){callable=$B.$call1(callable)
if(position){position=$B.decode_position(position)
return function(){try{return callable.apply(null,arguments)}catch(exc){$B.set_exception_offsets(exc,position)
throw exc}}}
return callable}

put the decode_position in the catch, there is no needs to pre-compute it. Performances when we get errors is not a big issue (you shouldn't have billions of errors per seconds). And this function should be called only once so precomputation is useless here.
maybe using arguments is slower than function(...args){ return callable.apply(null,args) } ?
maybe callable.apply(null, args) is slower than calling the function directly callable(...args) ?
maybe Browser is doing opti on f ?
If you could rewrite $B.$call so that it'd take the function parameters in argument, you'll prevent 1 function creation (and one more if you handle $B.$call ( $B.getattr) ) as a $B.$callM).

But yeah function creation is expected to be slowly, and maybe prevents browser to perform some kind of opti ???

I'm trying with Firefox. The previous results seems to be due, again to Chrome's crazy optimisations...

Screenshot 2023-10-13 at 14-54-37 Firefox 118 – Linux – 13_10_2023 12 44 22 UTC – Firefox Profiler

Firefox 2023-10-13 14.44 profile.json.gz

If I try to interpret this graph :

$B.args0 takes 38% of the time, mainly due to $B.parse_args (26% of the time).
$B.enter_frame and $B.leave_frame takes 18% of the time.
next takes 12% of the time.
$B.$call takes 8% of the time.

If I try to conclude from this graph :

I think that $B.$call could be reduced from 8% to ~4% (so -4% total execution time) if it didn't created a new anonymous function at each call. Even more if we could take a look in depth at $B.$call1 (line 5535).
$B.set_lineno takes 4% of execution time.
the for loop is taking 17% of the time. I think we could win few % if we optimize the for i in range(a,b,step=c) as a for(let i = a; i < b; i +=c) or as for(let i = a; i < b; ++i), but that'd require deciding that all integer are BigInt. Indeed, next is due to using an iterator.

$B.$call1 (4% of execution time) :

$B.$call1=function(callable){if(callable.__class__===$B.method){return callable}else if(callable.$factory){return callable.$factory}else if(callable.$is_class){
return callable.$factory=$B.$instance_creator(callable)}else if(callable.$is_js_class){
return callable.$factory=function(){return new callable(...arguments)}}else if(callable.$in_js_module){
return function(){var res=callable(...arguments)
return res===undefined ? _b_.None :res}}else if(callable.$is_func ||typeof callable=="function"){if(callable.$infos && callable.$infos.__code__ &&
(callable.$infos.__code__.co_flags & 32)){$B.last($B.frames_stack).$has_generators=true}
return callable}
try{return $B.$getattr(callable,"__call__")}catch(err){throw _b_.TypeError.$factory("'"+$B.class_name(callable)+
"' object is not callable")}}

Tests if the object has a specific tag $xxxxx to decide what to do. What if, on theses objects, when adding theses tags, you add them a $get_callable function (non-enumerable, non-configuration, non-writable?) ? They might help preventing all theses tests, as well as enabling to potentially give a prebuilt or a lazy built callable ? Then would you still need a $B.$call1 function separated from $B.$call ?

$B.enter_frame (10% of the time)

$B.enter_frame=function(frame){
if($B.frames_stack.length > 1000){var exc=_b_.RecursionError.$factory("maximum recursion depth exceeded")
$B.set_exc(exc,frame)
throw exc}
frame.__class__=$B.frame
$B.frames_stack.push(frame)
if($B.tracefunc && $B.tracefunc !==_b_.None){if(frame[4]===$B.tracefunc ||
($B.tracefunc.$infos && frame[4]&&
frame[4]===$B.tracefunc.$infos.__func__)){
$B.tracefunc.$frame_id=frame[0]
return _b_.None}else{
for(var i=$B.frames_stack.length-1;i >=0;i--){if($B.frames_stack[i][0]==$B.tracefunc.$frame_id){return _b_.None}}
try{var res=$B.tracefunc(frame,'call',_b_.None)
for(var i=$B.frames_stack.length-1;i >=0;i--){if($B.frames_stack[i][4]==res){return _b_.None}}
return res}catch(err){$B.set_exc(err,frame)
$B.frames_stack.pop()
err.$in_trace_func=true
throw err}}}else{$B.tracefunc=_b_.None}
return _b_.None}

don't use an array, use a tree with unidirectionnel links, with a preallocated pool of X. When X elements are consumed, raise the maximum recursion depth error. The reallocation of the stack may be what costs most of its execution time (I can't believe that <7 conditions could explain this 10%).
if there is a trace function the content seems a little costly, but that isn't important (in debug we don't care about speed).

$B.leave_frame (8.4% of the time)

$B.leave_frame=function(arg){
if($B.frames_stack.length==0){
return}
if(arg && arg.value !==undefined && $B.tracefunc){if($B.last($B.frames_stack).$f_trace===undefined){$B.last($B.frames_stack).$f_trace=$B.tracefunc}
if($B.last($B.frames_stack).$f_trace !==_b_.None){$B.trace_return(arg.value)}}
var frame=$B.frames_stack.pop()
if(frame.$has_generators){for(var key in frame[1]){if(frame[1][key]&& frame[1][key].__class__===$B.generator){var gen=frame[1][key]
if(gen.$frame===undefined){continue}
var ctx_managers=gen.$frame[1].$context_managers
if(ctx_managers){for(var cm of ctx_managers){$B.$call($B.$getattr(cm,'__exit__'))(
_b_.None,_b_.None,_b_.None)}}}}}
delete frame[1].$current_exception
return _b_.None}

This really shouldn't take so much time. Could $B.tracefunc be equal to None ? Then it'd be evaluated to true because currently None is an Object ?

I frankly don't know what is happening here.

$B.args0 (38% of the time - 26% parse args - 6% because using an iterator)

$B.args0=function(f,args){
var arg_names=f.$infos.arg_names,code=f.$infos.__code__,slots={}
for(var arg_name of arg_names){slots[arg_name]=empty}
return $B.parse_args(
args,f.$infos.__name__,code.co_argcount,slots,arg_names,f.$infos.__defaults__,f.$infos.__kwdefaults__,f.$infos.vararg,f.$infos.kwarg,code.co_posonlyargcount,code.co_kwonlyargcount)}

having a pre-built/lazy built slots for each functions could prevent from rebuilding one at each calls ???
arg_names seems to be an array ? Then use for(let i = 0; i < x.length; ++i) to get a little speed increase (6%?).
call to $B.parse_args seems to require lot of access. Setting info = $info may helps a very little ??? (not sure about this one)

$B.parse_args (26% of the time while having only one parameter). I think there is stuff to do here also.

$B.parse_args=function(args,fname,argcount,slots,arg_names,defaults,kwdefaults,vararg,kwarg,nb_posonly,nb_kwonly){
var nb_passed=args.length,nb_passed_pos=nb_passed,nb_expected=arg_names.length,nb_pos_or_kw=nb_expected-nb_kwonly,posonly_set={},nb_def=defaults.length,varargs=[],extra_kw={},kw
for(var i=0;i < nb_passed;i++){var arg=args[i]
if(arg && arg.__class__===$B.generator){slots.$has_generators=true}
if(arg && arg.$kw){
nb_passed_pos--
kw=$B.parse_kwargs(arg.$kw,fname)}else{var arg_name=arg_names[i]
if(arg_name !==undefined){if(i >=nb_pos_or_kw){if(vararg){varargs.push(arg)}else{throw too_many_pos_args(
fname,kwarg,arg_names,nb_kwonly,defaults,args,slots)}}else{if(i < nb_posonly){posonly_set[arg_name]=true}
slots[arg_name]=arg}}else if(vararg){varargs.push(arg)}else{throw too_many_pos_args(
fname,kwarg,arg_names,nb_kwonly,defaults,args,slots)}}}
for(var j=nb_passed_pos;j < nb_pos_or_kw;j++){var arg_name=arg_names[j]
if(kw && kw.hasOwnProperty(arg_name)){
if(j < nb_posonly){
if(! kwarg){throw pos_only_passed_as_keyword(fname,arg_name)}}else{slots[arg_name]=kw[arg_name]
kw[arg_name]=empty}}
if(slots[arg_name]===empty){
def_value=defaults[j-(nb_pos_or_kw-nb_def)]
if(def_value !==undefined){slots[arg_name]=def_value
if(j < nb_posonly){
if(kw && kw.hasOwnProperty(arg_name)&& kwarg){extra_kw[arg_name]=kw[arg_name]
kw[arg_name]=empty}}}else{var missing_pos=arg_names.slice(j,nb_expected-nb_kwonly)
throw missing_required_pos(fname,missing_pos)}}}
var missing_kwonly=[]
for(var i=nb_pos_or_kw;i < nb_expected;i++){var arg_name=arg_names[i]
if(kw && kw.hasOwnProperty(arg_name)){slots[arg_name]=kw[arg_name]
kw[arg_name]=empty}else{var kw_def=_b_.dict.$get_string(kwdefaults,arg_name)
if(kw_def !==_b_.dict.$missing){slots[arg_name]=kw_def}else{missing_kwonly.push(arg_name)}}}
if(missing_kwonly.length > 0){throw missing_required_kwonly(fname,missing_kwonly)}
if(! kwarg){for(var k in kw){if(! slots.hasOwnProperty(k)){throw unexpected_keyword(fname,k)}}}
for(var k in kw){if(kw[k]===empty){continue}
if(! slots.hasOwnProperty(k)){if(kwarg){extra_kw[k]=kw[k]}}else if(slots[k]!==empty){if(posonly_set[k]&& kwarg){
extra_kw[k]=kw[k]}else{throw multiple_values(fname,k)}}else{slots[k]=kw[k]}}
if(kwarg){slots[kwarg]=$B.obj_dict(extra_kw)}
if(vararg){slots[vararg]=$B.fast_tuple(varargs)}
return slots}

For the function itself the costs is 16% :

it create an array at each call => using a pool may help avoiding allocations at each call.

You really might get some speed increases and memory usage reduction. If I take a look at the GC, it allocated 10MB of RAM. Nearly every 10ms, the GC is called, and lasts ~0.1ms (so 1% of execution time). This may be an indication we are making many many allocations ?

~~@PierreQuentel Could you guide me on the steps to go from the JS generated code from the Editor to a JS file I'll execute on my browser ?~~

In this way, I'd be able to modify it, and modify $B to test my different hypothesis.

EDIT: I succeeded, I put the code inside a setTimeout() (not quite ideal), and put $B.imported["exec"] = {},. I'll try some tests on $B when I'll have the time.

A strange thing is that the convert JS code (from editor) seems to execute x2 faster than the py code (by last dev version).

Well, optimizing the for in range(a,b) by a for i=a;i<b ;++i (and removing one useless intermediate) is overall, 6.7% faster. The loop was 17% of total execution time, with the iterator being 6% of total execution time. This means a -61% execution time for the loop.

Pretty sure we can easily achieve a -50% execution time with all the things I found... and even more.

And indeed, removing the 2 debug trace conditions is a perf gain of only -0.84%. However, the more optimization will be made, the more it'll grow. Also, it seems there are other operations linked to the debug traces in leave_frame and enter_frame, that might also continue to increase this number.

If I had to make a guess, I'd believe we could achieve > -4% speed gain (~x2 if we achieve the -50% execution time with other optimisations, more if we do more / and ~x2 again with other trace-related code from other functions). But yeah, better keeping it on the side, and coming back at it when more stronger optimization will be done.

Disclaimer: I'm talking about optimization because it is fun (and it helps me exploring some concepts), but maybe this shouldn't be the priority.

TL:DR; We can really increase parameters resolution speed by implementing 8 different functions/ways to resolve them depending on how the function is defined and how it is called.

Instead of calling $B.$call( fct)(args), we would do something like :

$B.$callType1( fct, args) {
     return fct.callType1(fct, args);
}
$B.$callType2( fct, args, wargs) {
     return fct.callType1(fct, args, wargs);
}
//etc.

$B.parse_args() is a function that is generic. In python, there are 4 ways to pass arguments to a function, and 3 ways to declare an argument + default values :

1,2,3
a=1,b=2,c=3
*[1,2,3]
**{"a":1,"b":2,"c":3}

Which makes 2^7=128 possible combinations of function calls + the default values....

A. But maybe there are ways to write some special $B.$call(fct, args) functions for some pretty common calls enabling to speed them up ?

e.g. $B.$callS(fct, args) for (1,2,3,4) i.e. without "=", "*" or "**" inside the call.

The knowledge of the argument could enable us to use some heuristics on some type of calls and functions. Some of the order rules can also help e.g. positional can't be after named.

The list of calls I think can be interesting :

normal call
- normal call + *t => this is almost equivalent to "normal call"
all named args (having a presorted array of the function args would help) => this is in fact equivalent to normal call if (B. implemented), just need to check the names at the end.
- (normal call + t <= one or both) + named args => this is in fact almost equivalent to "_normal call + t_" if (B. implemented).
(normal call + *t <= one or both) + **a call [ can't be sorted at transpile time]
- named args, *a call => this is in fact almost equivalent to "(normal call + t <= one or both) + a call" if (B.** implemented).
- (normal call + *t <= one or both) + named args + **a call [ can't be sorted at transpile time]

Maybe there are ways to merge some without any costs. But that may the argument gestion easier.

From the 7, only 3 cases that are really different, with one only requiring to check the names positions at the end, so only 2 really different cases for calls.

The list of type of function that I find interesting :

named parameters [+ default args]
- compatible with all cases, so +2 cases.
**args
- only compatible with named parameters during calls + args, the way to handle them both is the same +1 case
named parameters [+ default args] + *t
- not compatible with args, or named args during calls +1 case**
named parameters [+ default args] + *t + non-positionnal
- requires a non-positionnal, +2 case
named parameters [+ default args] + **args
- requires a non-positionnal, +2 case
- named parameters [+ default args] + *t + **args
  - equivalent to named parameters [+ default args] + **args
- named parameters [+ default args] + *t + non-positionnal + **args
  - requires a non-positionnal, equivalent to parameters [+ default args] + *t + **args

Which makes 8 combinations that could be handled to speed up arguments resolution. And be really significative for some simple but usual fonctions calls. At translation time, the function declares some types of arguments resolution it will supports, foo.$resolve_args_CallType1 = resolve_args[DeclarationType1][CallType1], and when called, we let AST decide which functions to call.

B. Maybe we can, when calling, sort named parameters during AST transpilation ?

const f = ... // we have to keep order of operations
const Z = ..... // was originally the first given parameters
const X = .....// was originally the second given parameters
$B.call( f )({a:X, b:Z})

This would facilitate some operations/algorithms while costing nothing as it would be made during translation time.

C Javascript doesn't like when we do not precise the function parameters,maybe we should declare them, even if we are not using them. It helps the browser to know how many argument the function is likely to take.

Could be foo(...args) instead of using arguments ?

Note: In modern code, rest parameters should be preferred. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Functions/arguments

@PierreQuentel If you want to try out theses opti, maybe we should discuss first, because they are a little complex and would need to do things slowly and steps by steps.

Okay, **args must keep arguments order... so B. isn't possible.

Then, writing foo.$callX(2,3,4, {a:1, b:3, ...args}) / foo.$callX(2,3,4, args) / foo.$callX(2,3,4, null) (assuming last argument is always the named arguments), would helps merging, in parameters parsing, the combinations named, **args, and named+**args. This would also prevents one useless array and an object creation.

Aaaaand its not possible as Python needs to raise an exception when e.g. args contains a as there would be 2 variables with a... No wonders why Python is so slow... Lot of optimisations can't be performed due to the way they authorize and forbid functions calls...

The solution is maybe something like :

arguments = {}

// easily parse positional arguments. EZ.

if( ! hasNamedArguments) {

      if( offset <  isRequiredIdx )
            // throws exception
      for( ; offset < nbArgs; ++offset)
            arguments[ the_function_args_name[offset]  ] = the_function_default_values[offset]

       return //...
}

let keys = the_function_args_name.slice( offset ); // we need a copy + ignore the parameters already positioned.

// do it twice, for named arguments and args.
for( let name in varnames) {
      let i = indexOf(keys, name); // using a Set is more efficient for big sets, but I don't think it'll be faster for us.
      if( i === -1) {
          // handle error here (we can do less optimized operations here, it's not an issue.
          // if **args in function declaration, give them to a parameters[the_**args] = varnames[name].
          // maybe needs another function because if **args arguments no found in keys, it'll be put into **args parameters while named arguments could have been the one removing it from keys.
          // or adds a check : if( name in named_arguments ) => throws an error, else insert into **args.
      }
      arguments[name] = varnames[name]
      keys[i] = null // so that it won't be find again and we could raise an error.
}

// checks if some required parameters are still presents.
for( ; offset < nbArgs ; ++offset) {
    if(keys[offset] === null ) {
        if( offset < isRequiredIdx)
          ; // raise an exception.
        else
             arguments[ [the_function_args_name][offset]  ] = the_function_default_values[offset]
    }
}

Now we would need to somehow merge named arguments and **args.

EDIT: why not doing

let entries = [ ...Object.entries(named_arguments ), ...Object.entries(**args_argument) ]

Could be made in the function call :

call( .... , null) // no named
call( .... , Object.entries(**args_argument) )
call( .... , [ ["a", value], ["b", value] ] ) ) // not quite efficient as will create an array for each named argument...
call( .... , [ ["a", value], ["b", value], ...Object.entries(**args_argument) ])

call( .... , null, null) // no named
call( .... , Object.keys(**args_argument), Object.values(**args_argument) ) // needs to be careful.
call( .... , [ "a", "b" ], [value, value ] ) ) // maybe a little more efficient ?
call( .... , ["a", "b", ...Object.keys(**args_argument)], [value, value, ...Object.entries(**args_argument)] ) // needs to be careful.

OR Doing let args = [...Object.entries(named_arguments), .Object.entries(**args)] inside the hasNamedArguments ? OR Doing inside the hasNamedArguments :

let args_keys = [...Object.keys(named_arguments), .Object.keys(**args)]` 
let args_values = [...Object.values(named_arguments), .Object.values(**args)]`

OR Call :

call( .... , null, null) // no named
call( .... , {a:2,b:4}, null )
call( .... , null, args_argument )
call( .... , {a:2,b:4}, args_argument )

Then,

at transpilation time, checks if there are 2 named arguments with the same name.
duplicate the for( let name in varnames) loop.
inside the first for( let name in varnames) loop, no needs to check whether name is in named_arguments
inside the second for( let name in varnames) loop, only needs to check name is in named_arguments (if name in named_arguments ?).

Last solution might be the best ?

Question :

I see that you are inserting try{} catch(e){} everywhere, is there a reason for that, instead of doing something like :

throw new PythonError("message", js_error, frame)

Then catching it either during a python try: except:, or when giving a function to JS, or at "top-level" places (I guess with async functions / at the file level/etc) ?

try {} catch(e) {
      if( e instanceof PythonError){
           // do all the leave_frame, the set_exc, and the trace_exception here by unstacking the frame stack ?
          // this could be a function like $B.$process_py_exception(e)
           let frame_cursor = e.frame;
           while( frame_cursor !== frame ) {
                   $B.set_exc(e.err, frame_cursor);
                   if((! err.$in_trace_func) && frame_cursor.$f_trace !== _b_.None){
                        frame_cursor.$f_trace = $B.trace_exception()
                   }
                   $B.leave_frame();
                   frame_cursor = frame_cursor.previous
           }
           // do other stuff here.
      }
}

For this to work, you'll likely have to catch JS exception when calling JS functions. But else, JS exception shouldn't occurs during Brython function calls... and if it does, this is a bug that should be fixed ?

I put a summary in the first message of this issue so that it'll be easier to look up for things.

[Sorry, posted it in the wrong issue]

Conclusions :

args0_new is sooo much faster than previous parsing function.
$B.$call now costs HALF of the function call. This needs to be fixed.
$B.augm_assign is 23% of total exec time, I think this can be improved a lot.
$B.rich_op is 19% of exec time for +=, while it is 2.5% for +, it's strange.
other operations made in the function is a third of its call cost.
maybe some work to do on enter/leave frame, but for now it's not what is the most costly.

@PierreQuentel Do you have some code (only using Brython core) you'd want me to benchmark ?

New benchmark with the new args0 parsing method :

Screenshot 2023-11-16 at 12-34-46 Firefox 119 – Linux – 16_11_2023 11 31 40 UTC – Firefox Profiler

Raw logs :

Firefox 2023-11-16 12.31 profile.json.gz

Fichier :

    <!DOCTYPE html>
    <html>
        <head>
            <!-- Required meta tags-->
            <meta charset="utf-8">
            <title>X</title>

            <!-- Brython -->
            <script src="https://raw.githack.com/brython-dev/brython/master/www/src/brython.js"></script>
            <!--<script src="https://raw.githack.com/brython-dev/brython/master/www/src/brython_stdlib.js"></script>-->

            <script type="text/python">
from browser import document

def f(i):
    return i+i

def loop():
    acc = 0
    for i in range(100000000):
        acc += f(i)
    return acc

import time
start = time.time()

acc = loop()

end = time.time()
document <= "Done in " + str(end - start)

print(acc)
            </script>

        </head>
        <body>
        </body>
    </html>

Another possibility, that might be interesting :

remove the argument parsing from inside the function body
add a $call(foo, args) method that'd call the the arguments parser then call foo.

The advantage is that we'd have automatically an internal function we can call from JS. The internal function would take either an object (for Py implemented) or an array of arguments (for JS implemented) depending on the parser used.

This should also simplify the code of $B.$call that is currently very slow. We can modify the prototype of Object and/or Function to add a default $call method if necessary.

For python functions, $call could be used to factorize some code, like the try/catch and other boilerplate thingy. This would reduce code size, and therefore the parsing time (which is very slow).

brython-dev / brython