Make sure long running Table operations are interruptible

enso-org / enso

Enso Analytics is a self-service data prep and analysis platform designed for data teams.

https://ensoanalytics.com

Apache License 2.0

7.37k stars 323 forks source link

Make sure long running Table operations are interruptible #7129

Closed radeusgd closed 1 year ago

radeusgd commented 1 year ago

Our engine uses safepoints to interrupt running threads when an existing job has to be cancelled.

These safepoints are polled by our Truffle interpreter around method calls allowing easy interruption of Enso code. However, once we enter Polyglot Java code - these invocations were not interruptible.

This is especially problematic as these interrupts are used to cancel execution when some parameters of a computation have been changed and it needs to be recomputed. If the computation is not interruptible it has to be finished first which is a waste of time and resources.

'Ironically' many of the most intensive and possibly long running computations are written in Java (for performance) - that includes many Column operations, joins, current implementation of aggregates etc. This means these likely long running computations cannot be interrupted.

Apparently this can be solved quite easily - Context::safepoint can be called periodically in the polyglot Java helpers to poll for the safepoints and allow the interrupts.

I think we should add a call to Context::safepoint every few iterations in our long running operations. This should help with making the IDE more responsive.

[x] Create a scenario to check the interrupts (e.g. a Java method that runs for a few minutes, just sleeping). i. Try running it as-is and verify that once we change the inputs, it still waits until finish - wasting time and resources. ii. Extend it with periodic calls to safepoint, possibly printing something to logs to make it clear it works. iii. Run it again and verify that now it is interrupted immediately.
[ ] Adapt our long running operations in Table - map, zip, join, aggregate etc. to call safepoint.
- We can try calling it on every iteration. If we realise that it is too slow, we can instead call it every 10 or 100 iterations.
- We should go through the Table library and add the safepoints in all long running computations. This will include also: file reading (whether we should interrupt writing is a separate question - there for data integrity it may be better to keep it not-interruptible);
- essentially every for loop that does not have a constant / usually small bound (i.e. probably searching for columns does not necessary need to be interruptible as the column count should be relatively small - although maybe it is worth making it as well...) but definitely loops bound by row count which may be large.
[x] Check some benchmarks to see if there is any noticeable difference. This can help us tell if we should run it on every iteration or every 10 to decrease any potential impact (ideally it should be negligible).

jdunkerley commented 1 year ago

Related to #6917 stopping as well.

radeusgd commented 1 year ago

I've created a proof of concept to verify if this idea will work in practice.

Here's a video:

https://github.com/enso-org/enso/assets/1436948/66a15e40-edf7-449f-ab86-364d3043b3c4

You can see that when I'm switching the modulus setting (3, 5, 7, 13) - the node just below it (computed before the "long computation") refreshes very quickly in enso and java-safepoint mode. The node with the computation gets updated after 10-20s (that's the length of the "long computation"), but the node before updates almost immediately - even if the long computation from the previous setting did not finish, it is correctly interrupted.

To compare, when I switch to java-raw (no safepoints), the node just below the modulus setting does not get updated as quickly - we can see that its update waits until the scheduled "long computation" finishes - which takes some time. Showing that this kind of Java computation is not interruptible.

So the video above proves the viability of this change. Indeed, long running Java computations without safepoints are reducing the interactivity of the IDE - 'polluting' the run queue and delaying updates of other nodes, even if the result of such computation is no longer valid.

If we add safepoints, the Java computations become as interruptible as Enso ones and thus allow for more interactivity - trying to execute on latest data as soon as possible, without waiting for obsolete ones.

radeusgd commented 1 year ago

The code related to the demo: Interupts.zip

from Standard.Base import all
from Standard.Table import all
from Standard.Database import all
from Standard.AWS import all
import Standard.Visualization

polyglot java import foo.Foo

long_compute_pure_enso n setting =
    (0.up_to n).fold 0 (x -> y-> (x + y) % setting)

long_computation computation_type n setting order_enforcer = 
    # order enforcer makes sure that computation starts _after_ the "before" message finished
    order_enforcer.if_not_error <| case computation_type of
        "enso" ->
            long_compute_pure_enso n setting
        "java-safepoint" ->
            Foo.runComputationInterruptible n setting
        "java-raw" ->
            Foo.runComputation n setting

main =
    settings = [3, 5, 7, 13]
    settings_row = Table.from_rows (settings.map .to_text) [settings] . first_row
    setting = settings_row.at '5'

    computation_types = ["enso", "java-safepoint", "java-raw"]
    computation_row = Table.from_rows computation_types [computation_types] . first_row
    computation_type = computation_row.at 'java-raw'

    n_row = Table.from_rows ["10K", "1M", "200M", "1B"] [[(10^4), (10^6), (2 * 10^8), (10^9)]] . first_row
    n = n_row.at '1B'

    before = "Before long: "+setting.to_text
    result = Main.long_computation computation_type n setting before
    after = "(" + result.to_text + ") " + "After long: " + setting.to_text

package foo;

import org.graalvm.polyglot.Value;
import org.graalvm.polyglot.Context;

public class Foo {
    public static long runComputation(long n, long setting) {
        long acc = 0;
        for (int i = 0; i < n; ++i) {
            acc = (acc + i) % setting;
        }
        return acc;
    }

    public static long runComputationInterruptible(long n, long setting) {
        Context ctx = Context.getCurrent();
        long acc = 0;
        for (int i = 0; i < n; ++i) {
            acc = (acc + i) % setting;
            ctx.safepoint();
        }
        return acc;
    }
}

radeusgd commented 1 year ago

I also wanted to do a simple performance comparison to see what is the cost of adding the safepoints:

from Standard.Base import all

from Standard.Test import Bench

polyglot java import foo.Foo

long_compute_pure_enso n setting =
    (0.up_to n).fold 0 (x -> y-> (x + y) % setting)

long_computation computation_type n = 
    mod = 1997
    case computation_type of
        "enso" ->
            long_compute_pure_enso n mod
        "java-safepoint" ->
            Foo.runComputationInterruptible n mod
        "java-raw" ->
            Foo.runComputation n mod

main =
    n = 10^7
    iter_size = 20
    num_iters = 20

    IO.println "Measuring Java Raw"
    Bench.measure (long_computation "java-raw" n) "java-raw" iter_size num_iters

    IO.println "Measuring Java Safepoint"
    Bench.measure (long_computation "java-safepoint" n) "java-safepoint" iter_size num_iters

    IO.println "Measuring Enso"
    Bench.measure (long_computation "enso" n) "enso" iter_size num_iters

(Enso is just as a general reference)

The results are:

Measuring Java Raw
java-raw/iteration:0: 49.55ms
java-raw/iteration:1: 50.15ms
java-raw/iteration:2: 48.99ms
java-raw/iteration:3: 49.17ms
java-raw/iteration:4: 47.62ms
java-raw/iteration:5: 49.33ms
java-raw/iteration:6: 49.11ms
java-raw/iteration:7: 49.11ms
java-raw/iteration:8: 49.77ms
java-raw/iteration:9: 50.14ms
java-raw/iteration:10: 49.31ms
java-raw/iteration:11: 49.27ms
java-raw/iteration:12: 49.02ms
java-raw/iteration:13: 48.86ms
java-raw/iteration:14: 48.04ms
java-raw/iteration:15: 47.82ms
java-raw/iteration:16: 49.23ms
java-raw/iteration:17: 49.19ms
java-raw/iteration:18: 48.89ms
java-raw/iteration:19: 48.99ms
java-raw average: 49.08ms
Measuring Java Safepoint
java-safepoint/iteration:0: 51.5ms
java-safepoint/iteration:1: 49.37ms
java-safepoint/iteration:2: 48.7ms
java-safepoint/iteration:3: 48.6ms
java-safepoint/iteration:4: 49.08ms
java-safepoint/iteration:5: 48.36ms
java-safepoint/iteration:6: 48.8ms
java-safepoint/iteration:7: 50.26ms
java-safepoint/iteration:8: 48.65ms
java-safepoint/iteration:9: 49.23ms
java-safepoint/iteration:10: 49.26ms
java-safepoint/iteration:11: 54.08ms
java-safepoint/iteration:12: 55.5ms
java-safepoint/iteration:13: 55.58ms
java-safepoint/iteration:14: 64.02ms
java-safepoint/iteration:15: 89.76ms
java-safepoint/iteration:16: 85.49ms
java-safepoint/iteration:17: 77.77ms
java-safepoint/iteration:18: 84.21ms
java-safepoint/iteration:19: 94.28ms
java-safepoint average: 60.13ms
Measuring Enso
enso/iteration:0: 284.41ms
enso/iteration:1: 160.03ms
enso/iteration:2: 149.88ms
enso/iteration:3: 151ms
enso/iteration:4: 149.97ms
enso/iteration:5: 150.05ms
enso/iteration:6: 150.22ms
enso/iteration:7: 146.64ms
enso/iteration:8: 147.49ms
enso/iteration:9: 146.74ms
enso/iteration:10: 147.9ms
enso/iteration:11: 149ms
enso/iteration:12: 147.29ms
enso/iteration:13: 150.05ms
enso/iteration:14: 152.96ms
enso/iteration:15: 148.29ms
enso/iteration:16: 148.81ms
enso/iteration:17: 149.07ms
enso/iteration:18: 148.43ms
enso/iteration:19: 150.12ms
enso average: 156.42ms

Interestingly, the cost of safepoints seems to be negligible in the first 10 iterations, within the measurement precision.

But then a weird thing happens - the times have increased significantly. However I suspect it may be my Windows machine launching some background task or a cooling thing and not necessarily an issue with safepoints. I will run better measurements during dinner.

radeusgd commented 1 year ago

I've ran some more and longer iterations and the results are clearly showing that the cost of safepoints in the Java code seems to be negligible:

Safepoints	Average [ms]	Standard Deviation
No	437.4968	3.1423
Yes	435.7168	3.1194

Actually, from the measurements it seems like the code with the added safepoint was slightly faster (!) - but that is probably mostly due to the precision of the measurement.

cc: @JaroslavTulach

JaroslavTulach commented 1 year ago

Spreading Context.getCurrent().safepoint() in unbounded Java loops is clearly beneficial. Cost is negligible (if any) and IDE becomes more responsive.

How do we want to measure success of such Context.getCurrent().safepoint() additions? Just a visual inspection or an automatic reporting of problems when a computation is requested to interrupt, but it doesn't stop in 100ms, 1s, 5s?

radeusgd commented 1 year ago

Spreading Context.getCurrent().safepoint() in unbounded Java loops is clearly beneficial. Cost is negligible (if any) and IDE becomes more responsive.

Indeed - I'm adding a PR that tries to add these wherever it makes sense.

I'm however wondering if there are some better solutions we can use?

Firstly, inserting the safepoint everywhere is something we can do as library developers, but it seems harder to enforce for external contributors creating Enso libraries that also rely on Java. Additionally, we still cannot add these safepoints in code that relies on external libraries (e.g. blocking operations in JDBC, processing a big matrix in OpenCV).

Moreover, there are many operations that rely on the standard Java library which also will run in 'unbounded' time but we cannot insert safepoints there (e.g. System.arraycopy, Stream.map ... toList etc.). We can try to move our code style in a direction where we don't use these as much, but I'm not sure how viable such an approach would be.

Of course I think this is still a good solution in the short/mid term and so I'm very glad to implement it - the better interactivity will surely be worth it.

However, I'm wondering for the longer term if we can figure something better. @JaroslavTulach I know that you've been recently thinking around stuff like running visualizations in parallel using virtual threads (once they will be available to us). Do you think we could use some of these similar approaches to make our scheduled computations interruptible even without safepoints? (So that our users writing polyglot code would not have to care about it + we could more easily use these blocking libraries like OpenCV or JDBC).

Or maybe we need some hybrid approach, where the main computation relies on safepoints to interrupt it and any blocking/long-running native computation that cannot be interrupted, should be scheduled on some separate 'worker threads'. Then the main computation can be interrupted and re-run, ensuring IDE interactivity; while the long running one would just finish in background thread (sadly if its result is no longer needed, but I think if we cannot cancel the computation, it's still better to at least ensure it's not blocking the IDE).

I'll gladly hear your thoughts on this. cc: @JaroslavTulach @4e6 @hubertp as I think you may be interested in this.

How do we want to measure success of such Context.getCurrent().safepoint() additions? Just a visual inspection or an automatic reporting of problems when a computation is requested to interrupt, but it doesn't stop in 100ms, 1s, 5s?

Very good question! I didn't think of anything better than visual inspection for now, so that's what I did. I think your idea is very good! Ideally, the language server should report whenever an operation took more than X to be interrupted (not sure what good X is, we should tweak it, I'd aim for 100ms-1s). Ideally, if it could get a stacktrace from the computation showing us where it is 'stuck' that would be even better. @4e6 do you think it would be possible to implement such a check in the Language Server? I think it would help us tremendously to make sure the IDE is not stuck on computations where we potentially forgot interrupts and also analyse situations where we cannot add them easily and how much they impact the UX.

enso-bot[bot] commented 1 year ago

Radosław Waśko reports a new STANDUP for the provided date (2023-06-30):

Progress: Created a proof of concept test project to see if my hypothesis works. Verified that safepoints help make Java interop interruptible. Created a simple benchmark to verify the overhead (there seems to be none). Started adding the safepoints throughout the libraries. It should be finished by 2023-07-04.

Next Day: Next day I will be working on the same task. Go through remaining parts of libraries, prepare a PR.

radeusgd commented 1 year ago

Further performance measurements in PR comments.

JaroslavTulach commented 1 year ago

the language server should report whenever an operation took more than X to be interrupted (not sure what good X is, we should tweak it, I'd aim for 100ms-1s). Ideally, if it could get a stacktrace from the computation showing us where it is 'stuck' that would be even better.

The idea is based on FitnessViaPostMortem as the situation is basically the same. Not only we can get a stacktrace, but we start self-profiling and give you an .npss file to analyze in VisualVM. Let's plan to do this @hubertp, @4e6!

JaroslavTulach commented 1 year ago

Do you think we could use some of these similar approaches to make our scheduled computations interruptible even without safepoints?

Standard Java way to handle interrupting is:

Thread.interrupt
Thread.interrupted() check

The benefit is that every Java library can check the thread interrupted status. The problem is that most libraries don't do it. Worse, when they do it, they don't do it properly. NetBeans added support for interrupting executed tasks long time ago - however we soon realized that using it with unknown code often leads to half-written corrupted files and other goodies related to a work being stopped in the middle.

could use some of these similar approaches to make our scheduled computations interruptible even without safepoints

I doubt there is any magical solution. We need to get a callback:

Context.safepoint is a callback
Thread.interrupted() directly or via network or disk operation is a callback
splitting the work to small chunks (aka actors) guarantees callbacks in between those chunks
progress API not only improves UI, but also forces libraries to make callbacks
running libraries code by Espresso would allow us to manage the libraries code more and give us safepoints spread by the Espresso interpreter by default

Any callback would do. One just has to call it whenever iterating over unbounded data. Open design question is to make such callback natural, so it happens when needed.

radeusgd commented 1 year ago

Any callback would do. One just has to call it whenever iterating over unbounded data. Open design question is to make such callback natural, so it happens when needed.

What about operations that are blocking 'by nature'? i.e. waiting for a reply in JDBC driver OR executing a native library (e.g. OpenCV)? I assume we cannot easily get callbacks in there. Shall we ran these on some separate worker threads?

enso-bot[bot] commented 1 year ago

Radosław Waśko reports a new STANDUP for yesterday (2023-07-03):

Progress: Sifted through rest of libs and added safepoints wherever deemed sensible. Tested after the changes - one scenario was breaking. Debugged it and did a quickfix. It should be finished by 2023-07-04.

Next Day: Next day I will be working on the same task. Implement proper fix (safepoints only when the ReportingStreamDecoder is run on an Enso thread, and not if in the background); refactor DelimitedReader to better care about parser thread (that I did not realize in the past existed 🙃 ) ; test safepoints timings using SafepointALot. Check a PoC for a background thread. Start next task.

enso-bot[bot] commented 1 year ago

Jaroslav Tulach reports a new STANDUP for yesterday (2023-07-04):

Progress: - Discussing Context.safepoint with Radek

At https://github.com/enso-org/enso/issues/7129#issuecomment-1619543581
addressing Radek's review in onboarding PR:
At https://github.com/enso-org/enso/pull/7181 It should be finished by 2023-07-04.

Next Day: More vacationing

JaroslavTulach commented 1 year ago

What about operations that are blocking 'by nature'? i.e. waiting for a reply in JDBC driver OR executing a native library (e.g. OpenCV)? I assume we cannot easily get callbacks in there.

Checking Thread.interrupted() state shall work from network layer (yielding InterruptedIOException as well as with properly written native code.

Shall we ran these on some separate worker threads?

Maybe. I am not able to give a generic reply. I'd prefer to evaluate situation on case-by-case basis.

enso-bot[bot] commented 1 year ago

Radosław Waśko reports a new STANDUP for the provided date (2023-07-04):

Progress: Adapted ReportingStreamDecoder and documented it. Refactored DelimitedReader to better manage the background reader thread. Some safepoint measurements using SafePointALot (did not get much useful results, would need to spend more time on tailored benchmarks which does not seem valuable at this time). It should be finished by 2023-07-04.

Next Day: Next day I will be working on the #7192 task. Do some small tasks.