Add benchmarks to `runtime-compiler`

Akirathan commented 10 months ago

There are no benchmarks for parsing or compiling. Let's add benchmarks to runtime-compiler or/and runtime-parser projects. Ideally, make sure that these benchmarks are visible in https://enso-org.github.io/engine-benchmark-results/engine-benchs.html

Will be good for #7054

JaroslavTulach commented 9 months ago

Let's create directory compiler next to existing semantic and put the benchmark there. It will then automatically appear at https://enso-org.github.io/engine-benchmark-results/engine-benchs.html

engine benchmark results

JaroslavTulach commented 9 months ago

Let the benchmark generate traditional end user code:

from Standard.Base import all

main =
    operator1 = File.read "blabla"
    operator2 = operator1.xyz 2 where=Location.Start
    operator3 = operator1.abc "Hi3"

If we can generate such code, then we can have a benchmark for 100, thousand and ten thousand line file and compare the scalability of our implementations.

JaroslavTulach commented 9 months ago

What shall we measure? We want to measure creation of the IR and applying compiler Passes to it - probably how long it takes to invoke Compiler.run method. However that method requires an implementation of CompilerContext.Module which isn't easy to get. One way is to mock it, but probably easier to just org.graalvm.polyglot.Context.eval("enso", ....) and get a reference to main method (without invoking it). That shall be simpler (as the API already exists) and good enough to begin with.

radeusgd commented 8 months ago

Let the benchmark generate traditional end user code:
from Standard.Base import all

main =
    operator1 = File.read "blabla"
    operator2 = operator1.xyz 2 where=Location.Start
    operator3 = operator1.abc "Hi3"
If we can generate such code, then we can have a benchmark for 100, thousand and ten thousand line file and compare the scalability of our implementations.

Won't this hide the issue of various levels of complexity? I.e. a 10 line function with lots of variables and dependencies may be more complex to analyze (especially if something in it were O(N^2)) than 10 independent 1-2 line functions that are very trivial.

Maybe we could try using our Standard.Base library as the 'corpus' for the benchmarks? It should contain methods of varying levels of complexity and is probably the best 'example' we can currently get of a big codebase in Enso that uses various kinds of patterns.

What do you think?

Akirathan commented 8 months ago

@radeusgd I think that the proposal from @JaroslavTulach makes more sense for now, as it resembles more closely what is actually parsed and compiled in the IDE.

Besides, when you import from Standard.Base import all, all the modules that are transitively reachable are compiled. So I am not sure I follow your reasoning here.

radeusgd commented 8 months ago

Besides, when you import from Standard.Base import all, all the modules that are transitively reachable are compiled. So I am not sure I follow your reasoning here.

🤦 oh I have somehow completely missed that. Then indeed my suggestion is moot, you are 100% right.

radeusgd commented 8 months ago

Well I guess what still stands is - I don't think we should be generating sources by multiplying the 3 line example multiple (10s, 100s, 1000s) times.

Because then the timing will get saturated by the time to parse these simple 3 lines, instead of the time needed to compile Standard.Base - which I imagine is much more complicated to compile and provides a better benchmark of practical usage.

Maybe both are worth measuring though.

JaroslavTulach commented 8 months ago

Maybe we could try using our Standard.Base library as the 'corpus' for the benchmarks?

Standard.Base is a moving target not really suitable for a "unit benchmark" of the compiler
When there is a regression: is it because of a problem in runtime/compiler or because of changes in Standard.Base?
Btw. such an "integration benchmark" has already been proposed by #8820

complex to analyze (especially if something in it were O(N^2))

The point of having files of various sizes is exactly to identify complexity of our algorithms! We don't want O(N^2) algorithms at places where speed matters.

10 independent 1-2 line functions that are very trivial.

There can be a benchmark that generates 10, 100, 1000 simple functions as well. That checks scalability from another angle.

I don't think we should be generating sources by multiplying

The important goal is to have the benchmarking in place, run some benchmarks, collect the results. And, most importantly, make it easy to add new benchmarks to the system when a new performance problem is found.