enso-org / enso

Hybrid visual and textual functional programming.
https://enso.org
Apache License 2.0
7.31k stars 318 forks source link

StackOverflow when multiple Managed Resources are being cleaned up at the same time #10211

Open radeusgd opened 4 weeks ago

radeusgd commented 4 weeks ago

Try running the following script:

from Standard.Base import all
import Standard.Base.Runtime.Managed_Resource.Managed_Resource
import Standard.Base.Runtime.Ref.Ref

type My_Resource
    Value counter:Ref

    close self =
        self.counter.modify (x-> x-1)
        Nothing

    allocate counter:Ref =
        counter.modify (+1)
        Managed_Resource.register (My_Resource.Value counter) close_resource

close_resource resource = resource.close

repeat_cleanup_until_done counter =
    go i =
        if counter.get == 0 then Nothing else
            if i % 100 == 0 then
                IO.println "Still "+counter.get.to_text+" resources to clean up..."
            Runtime.gc
            @Tail_Call go i+1
    go 1

main =
    n = 10
    counter = Ref.new 0
    IO.println "Allocating resources..."
    0.up_to n . each _->
        My_Resource.allocate counter

    IO.println "Cleaning up..."
    repeat_cleanup_until_done counter
    IO.println "All cleaned up! "+counter.get.to_text

With n = 10 it will happily allocate and then clean up resources:

Allocating resources...
Cleaning up...
All cleaned up! 0

Now, try changing n to 10000:

-    n = 10
+    n = 10000

and running it again.

I'm consistently getting a StackOverflow failure:

Allocating resources...
Cleaning up...
Execution finished with an error: Resource exhausted: Stack overflow
radeusgd commented 4 weeks ago

As mentioned on Discord I think the problem is that the finalizer of a resource runs Enso code, which polls safepoints. Then in that safepoint, another finalizer is scheduled to be run. If there's lots of pending finalizers scheduled to run, each runs inside of another, creating a cascade of finalizers running on top of one another, pumping up the stack a lot and causing the overflow.

Instead, we should ensure that only one finalizer shall run at once. The code of the finalizer should probably still be polling safepoints (for all the other purposes), but as long as a finalizer is entered, no other finalizer should start inside of it - instead it should be enqueued and run once the first finalizer finishes.

radeusgd commented 4 weeks ago

Instead of the log above, I'm also sometimes getting the following error:

Allocating resources...
Cleaning up...
Execution finished with an error: java.lang.NoClassDefFoundError: Could not initialize class com.oracle.truffle.api.interop.InteropLibraryGen$Default$Uncached
        at <java> org.graalvm.truffle/com.oracle.truffle.api.interop.InteropLibraryGen$Default.createUncached(InteropLibraryGen.java:555)
        at <java> org.graalvm.truffle/com.oracle.truffle.api.interop.InteropLibraryGen$Default.createUncached(InteropLibraryGen.java:546)
        at <java> org.graalvm.truffle/com.oracle.truffle.api.library.LibraryFactory.getUncachedSlowPath(LibraryFactory.java:413)
        at <java> org.graalvm.truffle/com.oracle.truffle.api.library.LibraryFactory.getUncached(LibraryFactory.java:404)
        at <java> org.graalvm.truffle/com.oracle.truffle.api.interop.InteropLibraryGen$UncachedDispatch.isException(InteropLibraryGen.java:7594)
        at <java> org.graalvm.truffle/com.oracle.truffle.polyglot.PolyglotThreadLocalActions$AbstractTLHandshake.accept(PolyglotThreadLocalActions.java:609)
        at <java> org.graalvm.truffle/com.oracle.truffle.polyglot.PolyglotThreadLocalActions$AbstractTLHandshake.accept(PolyglotThreadLocalActions.java:546)
        at <java> org.graalvm.truffle/com.oracle.truffle.api.impl.ThreadLocalHandshake$Handshake.perform(ThreadLocalHandshake.java:219)
        at <java> org.graalvm.truffle/com.oracle.truffle.api.impl.ThreadLocalHandshake$TruffleSafepointImpl.processHandshakes(ThreadLocalHandshake.java:368)
        at <java> org.graalvm.truffle/com.oracle.truffle.api.impl.ThreadLocalHandshake.processHandshake(ThreadLocalHandshake.java:159)
        at <java> org.graalvm.truffle.runtime/com.oracle.truffle.runtime.hotspot.HotSpotThreadLocalHandshake.poll(HotSpotThreadLocalHandshake.java:79)
        at <java> org.graalvm.truffle/com.oracle.truffle.api.TruffleSafepoint.poll(TruffleSafepoint.java:155)
        at <java> org.enso.runtime/org.enso.interpreter.node.ClosureRootNode.execute(ClosureRootNode.java:83)
        at <enso> resource-overflow.repeat_cleanup_until_done.go<arg-0>(Internal)
        at <enso> resource-overflow.repeat_cleanup_until_done.go<arg-2>(resource-overflow.enso:24:13-29)
        at <enso> resource-overflow.repeat_cleanup_until_done.go(resource-overflow.enso:20-24)
        at <enso> resource-overflow.main(resource-overflow.enso:35:5-37)
Akirathan commented 3 weeks ago

Possibly related StackOverflowError in Table_Tests in https://github.com/enso-org/enso/actions/runs/9451385966/job/26032192518?pr=10192#step:7:1712. I could not reproduce that one locally.