Try to improve performance of Dataflow Errors

radeusgd commented 1 year ago

The benchmarks in #8130 show that our handling of Panics is super fast, however the dataflow errors are far behind them - being 200x - 8500x slower.

It would be good if we could get the dataflow errors performance a bit closer to that of panics, because dataflow errors are the primary way of reporting errors in Enso standard library.

While at it, it would be good to still retain the ability to query the stack trace of a dataflow error, when needed. In most scenarios the stack trace is ignored (so the computational effort used to create it is wasted), but in some cases the stack trace may be very helpful - e.g. when reporting test suite failures or debugging. Somehow, Panics seem to be able to retain on-demand stack traces while also being fast - ideally it would be great to have the same for dataflow errors, if possible.

radeusgd commented 1 year ago

See this comment in the related PR for instructions how to run the benchmarks on JMH.

JaroslavTulach commented 1 year ago

DataflowError creation used to be slow. It slowed down at:

cc73338 - Radosław Waśko radoslaw.wasko@enso.org Mar 18, 2022 The library developer should be able to handle specific types of Panics while passing through others (#3344)

Also ensures that Dataflow Errors have proper stack traces (earlier they did not point at the right location).

radeusgd commented 1 year ago

DataflowError creation used to be slow. It slowed down at:

cc73338 - Radosław Waśko radoslaw.wasko@enso.org Mar 18, 2022 The library developer should be able to handle specific types of Panics while passing through others (#3344)

Also ensures that Dataflow Errors have proper stack traces (earlier they did not point at the right location).

Yes, I suspected fixing stack traces was likely the cause for slowdown. We probably need to add them in a smarter 'more lazy' way.

Note that before this fix the error stack traces were completely useless and misleading - as IIRC if you queried the error stack trace, you were getting the location of the first time stack trace of that error was queried - i.e. usually just your current Runtime.get_stack_trace.

radeusgd commented 1 year ago

I was curious and did a little bit of digging into TruffleStackTrace.

So apparently, while unwinding the stack after an exception is thrown, whenever passing a CallTarget, the DefaultRuntimeAccessor.LANGUAGE.onThrowable is called, which in turn calls TruffleStackTrace.addStackFrameInfo which lazily adds stack frames to AbstractTruffleException.

Thus, if a Panic is caught not many frames later, and its full stack trace is never queried - we are saved from the overhead of constructing the full stack trace - hence the awesome performance.

Case for dataflow errors

The nice performance of panics is mostly due to 'lazy' adding of stack frames when we are unwinding the stack, looking for a try-catch clause.

Unfortunately, we cannot re-use this cool effect for Dataflow errors - because they do not propagate through stack unwinding, but are simply propagated by value passing. So most likely, we cannot easily perform such lazy optimization.

I guess a simple thing we still could do is to limit the depth of the stack trace - for many use cases, knowing only a few frames may still be enough to debug it.

radeusgd commented 1 year ago

Issue with panics

When looking into this, I noticed:

Note that if the Throwable is caught, its stack trace should be filled eagerly with TruffleStackTrace.fillIn(Throwable), unless it can be guaranteed to be re-thrown in the same call target, or that the stack trace will not be used.

within the TruffleStackTrace docs.

I realised that this would make Panics less efficient than they are now.

And indeed - we don't do what is described there. Is that ok?

Well, apparently not exactly. This actually proved to me something of which I was suspicious for a long time. When working with Enso library tests, I've been sifting through tons of Enso Panic stack traces. And I felt like sometimes the stack trace may be a little bit off, but I never digged deeper into this and thought I was mistaken. Apparently I was right.

See this repro:

from Standard.Base import all

type My_Error
    Error

handler caught_panic =
    do_the_handling caught_panic

do_the_handling caught_panic =
    st = caught_panic.stack_trace
    st.each el->
        IO.println el.to_display_text

f1 = f2
f2 = f3
f3 =
    Panic.catch Any handler=handler <|
        f4

f4 = f5
f5 = Panic.throw My_Error.Error

main =
    f1

running it yields:

at Panic.throw (Unknown location)
at stacks.f5 (C:\NBO\repr\stacks.enso:21:6-31)
at stacks.f4 (C:\NBO\repr\stacks.enso:20:6-7)
at stacks.f3<arg-1> (C:\NBO\repr\stacks.enso:18:9-10)
at Panic.type.get_attached_stack_trace (Unknown location)
at Caught_Panic.stack_trace (built-distribution\enso-engine-0.0.0-dev-windows-amd64\enso-0.0.0-dev\lib\Standard\Base\0.0.0-dev\src\Panic.enso:233:9-43)
at stacks.do_the_handling (C:\NBO\repr\stacks.enso:10:10-33)
at stacks.handler (C:\NBO\repr\stacks.enso:7:5-32)
at Panic.catch (Unknown location)
at stacks.f3 (C:\NBO\repr\stacks.enso:17-18)
at stacks.f2 (C:\NBO\repr\stacks.enso:15:6-7)
at stacks.f1 (C:\NBO\repr\stacks.enso:14:6-7)
at stacks.main (C:\NBO\repr\stacks.enso:24:5-6)

This stacktrace is actually wrong. It should be:

at Panic.throw (Unknown location)
at stacks.f5 (C:\NBO\repr\stacks.enso:21:6-31)
at stacks.f4 (C:\NBO\repr\stacks.enso:20:6-7)
at stacks.f3<arg-1> (C:\NBO\repr\stacks.enso:18:9-10)
at Panic.catch (Unknown location)
at stacks.f3 (C:\NBO\repr\stacks.enso:17-18)
at stacks.f2 (C:\NBO\repr\stacks.enso:15:6-7)
at stacks.f1 (C:\NBO\repr\stacks.enso:14:6-7)
at stacks.main (C:\NBO\repr\stacks.enso:24:5-6)

So we have additional 4 frames:

at Panic.type.get_attached_stack_trace (Unknown location)
at Caught_Panic.stack_trace (built-distribution\enso-engine-0.0.0-dev-windows-amd64\enso-0.0.0-dev\lib\Standard\Base\0.0.0-dev\src\Panic.enso:233:9-43)
at stacks.do_the_handling (C:\NBO\repr\stacks.enso:10:10-33)
at stacks.handler (C:\NBO\repr\stacks.enso:7:5-32)

added in the middle that are not related with the 'throw' site of the panic, but are part of the panic handler.

Why?

Because when we catch a Panic, we do not fill in the stack trace (hence catching it is very cheap - that is the cool part). Truffle fills in the stack trace for us the first time we request it. So the stack trace gets filled in at the location we request it, which may not exactly be the location we caught it at (that is the less cool part). Usually it is 'close' to the 'correct' location, so it is off only by a few frames - so the stack trace may still be useful.

However, it is technically incorrect.

Should we improve it?

I think it is not good to return invalid stack traces, even if they are off by just a few frames. Technically, if I save my caught_panic and pass it around, I can get a completely 'crazy' stack trace, not very related to the actual panic - the top few frames will come from the panic, but the filled in bottom can really be arbitrary.

Of course, filling it in on catch comes with a cost - catching panics would no longer be as fast. It would likely start to be closer to the current Dataflow error performance. Well this ticket was about making dataflow errors faster, not slowing down Panics to make stuff equal 😓

The issue only arises when we want to access the stack trace. Oftentimes, we catch the panic and actually ignore the stack trace - and that's when we expect the good performance.

My suggestion is to create two versions of Panic.catch: Panic.catch_with_stacktrace and Panic.catch_without_stacktrace (I guess one of these could still be named Panic.catch):

Panic.catch_with_stacktrace would fill in the missing part of the stack trace at the 'catch-site', ensuring that the stack trace is correct. This variant will be slower, but it will ensure correct stack trace. Users that want to access the stack trace (when they would pay this price anyway) can use this variant. This will not really be slower, as code accessing the stacktrace was suffering from the price of filling it in anyway.
Panic.catch_without_stacktrace would not fill the stack trace, and it would set a flag boolean isStackTraceAvailable to false. This way, if caught_panic.stack_trace is ever accessed, we can throw something like IllegalStateException telling the user that this is not allowed. Thus, we will not expose an invalid stack trace. This variant could be used by users that want better performance but do not care about the stack trace anyway.

I think with such solution we can still retain good performance of catching panics and ensure that our stack traces are not misleading.

JaroslavTulach commented 12 months ago

We can fix the stacktrace (thanks for such a detailed investigation), but that's probably task for another bug, not one titled Try to improve performance of Dataflow Errors.

radeusgd commented 12 months ago

We can fix the stacktrace (thanks for such a detailed investigation), but that's probably task for another bug, not one titled Try to improve performance of Dataflow Errors.

Fair, I will create a separate ticket for this: #8153

JaroslavTulach commented 10 months ago

Here is a guaranteed way to speed benchmarks up:

enso$ git diff
diff --git engine/runtime/src/main/java/org/enso/interpreter/runtime/error/DataflowError.java engine/runtime/src/main/java/org/enso/interpreter/runtime/error/DataflowError.java
index 5ca69cda53..695a695076 100644
--- engine/runtime/src/main/java/org/enso/interpreter/runtime/error/DataflowError.java
+++ engine/runtime/src/main/java/org/enso/interpreter/runtime/error/DataflowError.java
@@ -41,7 +42,7 @@ public final class DataflowError extends AbstractTruffleException {
   public static DataflowError withoutTrace(Object payload, Node location) {
     assert payload != null;
     DataflowError result = new DataflowError(payload, location);
-    TruffleStackTrace.fillIn(result);
+    // TruffleStackTrace.fillIn(result);
     return result;
   }

obviously with such a change the benchmarks run lightening fast. DataflowError becomes just a value with recorded location of origin (e.g. the top of the stack trace). With following change

diff --git engine/runtime/src/main/java/org/enso/interpreter/node/expression/builtin/error/ThrowErrorNode.java engine/runtime/src/main/java/org/enso/interpreter/node/expression/builtin/error/ThrowErrorNode.java
index 12ebeb6d41..20db654b05 100644
--- engine/runtime/src/main/java/org/enso/interpreter/node/expression/builtin/error/ThrowErrorNode.java
+++ engine/runtime/src/main/java/org/enso/interpreter/node/expression/builtin/error/ThrowErrorNode.java
@@ -8,7 +8,9 @@ import org.enso.interpreter.runtime.error.DataflowError;
 @BuiltinMethod(
     type = "Error",
     name = "throw",
-    description = "Returns a new value error with given payload.")
+    description = "Returns a new value error with given payload.",
+    inlineable=true
+  )
 public class ThrowErrorNode extends Node {
   public Object execute(VirtualFrame giveMeAStackFrame, Object payload) {
     return DataflowError.withoutTrace(payload, this);

we make sure the real location of Error.throw is recorded (instead of some internal builtin location). As a result following x.enso program

from Standard.Base import all
import Standard.Base.Errors.Illegal_Argument.Illegal_Argument

deep n = if n <= 0 then Error.throw (Illegal_Argument.Error "Problem") else deep n-1

main =
    d = deep 10
    d.get_stack_trace_text

generates following output:

at <enso> x.deep<arg-1>(x.enso:4:25-70)

I consider such output satisfactory to locate where the error is coming from. Am I (totally) wrong?

Btw. the Java method to construct the DataflowError is called withoutTrace - I believe that's what my change does, right? Constructing an DataflowError without a stack trace!

radeusgd commented 10 months ago

I consider such output satisfactory to locate where the error is coming from. Am I (totally) wrong?

Unfortunately, it is often far from enough to locate the error. i. The error may be thrown within a helper method deeper down the call chain, so it may not tell us the particular operation where the error failed. ii. Even more, we are using an 'idiom' where we often catch a Java exception and 'convert' it to a dataflow error. Since the logic for conversion of the same error is most of the time the same, we use helper functions to handle this catch-and-throw. Examples: Illegal_Argument.handle_java_exception, handle_sql_errors and tons more.
- I think we rather need to still support this idiom, changing this abstraction into code duplication is not a path we could take.
- So with that, most often the location of the error will be within some handle_ function - which really does not tell us anything useful at all.
Debugging i. Obviously, knowing the full stack trace is invaluable for debugging. I cannot imagine a world where we do not have traces on dataflow errors - when debugging various tests it is just immensely helpful to be able to read the stacktrace of where the error originated from, with all the context of the full call-chain. I don't think we can lose that and retain good developer experience. ii. This may be less of a concern because: If we get some solution to (1), I guess we could enable some 'debug mode' flag to enable the full stack traces only during development. But we do need some way to enable this, as otherwise working on library development would get significantly less efficient.

enso-bot[bot] commented 10 months ago

Jaroslav Tulach reports a new STANDUP for yesterday (2023-12-20):

Progress: - DataflowError PR: https://github.com/enso-org/enso/pull/8608

review URI_With_Query: https://github.com/enso-org/enso/pull/8591#pullrequestreview-1790127195
debug in Chrome Dev Tools: https://github.com/JaroslavTulach/graaljs-websocket/commit/3c592f0efd48a6320998951564447f4b91ca4444
meetings & puddings It should be finished by 2023-12-27.

Next Day: More on DataflowErrors

JaroslavTulach commented 10 months ago

often catch a Java exception and 'convert' it to a dataflow error.

This is done via DataflowError.withTrace - that one is still inheriting the top of the stack trace from Panic and (slowly) recording the rest (possibly with some limit) of the stack trace.

enable some 'debug mode' flag to enable the full stack traces only during development

There is JAVA_OPTS=-ea now which could be used for such a purpose.

radeusgd commented 10 months ago

often catch a Java exception and 'convert' it to a dataflow error.

This is done via DataflowError.withTrace - that one is still inheriting the top of the stack trace from Panic and (slowly) recording the rest (possibly with some limit) of the stack trace.

Yeah, although that is only used when we convert the Panic instance into dataflow error (we catch it and change how it flows through the computation), while still remaining a Java interop instance. What we do most of the time, is we rethrow a new Enso Atom instance as dataflow error - which is a more user-oriented value. I guess we'd need to attach the original Java error as cause to be able to access its stack trace.

enable some 'debug mode' flag to enable the full stack traces only during development

There is JAVA_OPTS=-ea now which could be used for such a purpose.

I think we should not override -ea with so many implications. Why not use a separate env var? Although I guess -ea could be OK too.

enso-bot[bot] commented 10 months ago

Jaroslav Tulach reports a new STANDUP for yesterday (2023-12-21):

Progress: - Analyzing DataflowError pros & cons: https://github.com/enso-org/enso/pull/8608#issuecomment-1865442556

testing DataflowError: https://github.com/enso-org/enso/pull/8608/commits/2350c6b11ff96c93cd8e639b126ec64a03dc6de8
unifying stacktrace processing
trying to find out errors
fixing HTTP & Cloud errors: https://github.com/enso-org/enso/pull/8608/commits/9a029fd3cf27ab5df79db3cc2a0116ac2d763522
fix RuntimeVisualizationsTest: https://github.com/enso-org/enso/pull/8608/commits/8d57904dce954bdf632643515b3a8ae6295af86c
One fix in RuntimeErrorsTest: https://github.com/enso-org/enso/pull/8608/commits/bcb896073df96f9db9ea396f1212da5ff4e30930
back & forth fixes: https://github.com/enso-org/enso/pull/8608/commits/bcb896073df96f9db9ea396f1212da5ff4e30930
libs standup It should be finished by 2023-12-27.

Next Day: Polish DataflowErrors PR for integration

JaroslavTulach commented 10 months ago

As of https://github.com/enso-org/enso/pull/8608/commits/891b08d766eacf7e3b043c57d62a9aabd9e619e4 the Benchmark run indicates the Panics_And_Errors_10000_Dataflow_Error test (created by #8130 pull request) has been sped up significantly:

7000x speedup

Previously it took 329ms, now it runs in 0,045ms.

enso-bot[bot] commented 10 months ago

Jaroslav Tulach reports a new STANDUP for yesterday (2023-12-22):

Progress: - Analyzing CI failures: https://github.com/enso-org/enso/pull/8608#issuecomment-1867227763

Dropping just a single frame: https://github.com/enso-org/enso/pull/8608/commits/2b2f253bcb351d6efe6f21e6f68a1d091f26cc68
benchmark are OK: https://github.com/enso-org/enso/pull/8608#issuecomment-1867264975
Playing with -ea: https://github.com/enso-org/enso/pull/8608#issuecomment-1867346712 It should be finished by 2023-12-27.

Next Day: Polish DataflowErrors PR for integration

enso-bot[bot] commented 10 months ago

Jaroslav Tulach reports a new STANDUP for yesterday (2023-12-23):

Progress: - addressing comments: https://github.com/enso-org/enso/pull/8608/files#r1435496429

mixing Java & Enso: https://discord.com/channels/401396655599124480/1188005190998179881/1188005443293945926 It should be finished by 2023-12-27.

Next Day: Polish DataflowErrors PR for integration

GitHub
DataflowError.withoutTrace shall not store a trace by JaroslavTulach · Pull Request #8608 · enso-org/enso
Pull Request Description Fixes #8137 by storing only Error.throw location in DataflowError. There is a getStackTrace message in InteropLibrary, so using it to take full control over the DataflowErr...

Discord
Discord - A New Way to Chat with Friends & Communities
Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities.

enso-bot[bot] commented 10 months ago

Jaroslav Tulach reports a new STANDUP for yesterday (2023-12-24):

Progress: - testing mixed stacktrace: https://github.com/enso-org/enso/pull/8608/commits/f943a2b8d305f7ff3ba15b39ea17e1be2e471a5c

reporting CI error: https://github.com/enso-org/enso/issues/8632
error_message can be Nothing: https://github.com/enso-org/enso/pull/8608/commits/2e7199f71451f3ebff284f9c35966ed6ede65141
PR integrated: https://github.com/enso-org/enso/pull/8608 It should be finished by 2023-12-27.

Next Day: Bugfix something

enso-bot[bot] commented 10 months ago

Jaroslav Tulach reports a new STANDUP for yesterday (2023-12-26):

Progress: - running vitest in Graal.js mode It should be finished by 2023-12-27.

Next Day: Get vitest running

enso-org / enso