Taking Structured Concurrency Seriously

In julia 1.3 the threadsafe runtime for truly parallel tasks has arrived which will greatly increase the interest in concurrent computation in Julia's core numerical and technical computing community. For the moment, Threads.@spawn is experimental, but now is a good time to think about APIs for expressing concurrent computation in a safe and composable way with the aim to use such an API as an ecosystem gets built on top of the new functionality.

There's been a lot of recent excitement and hard work on libraries to support so-called structured concurrency, with the claim that this is fundamentally a better model for composable concurrent computation.

I'm convinced that people are onto something here and that the overall ideas are an excellent fit for the Julia language. However, there's a lot of prior art to absorb so I have started a document with the aim to do two things:

Survey existing ideas and implementations
Explore how these ideas fit within the Julia language and runtime.

Click here to read the WIP document

Meta note

I'm using this julep as a guinea pig to try out a go-proposals style workflow, as discussed at https://github.com/JuliaLang/julia/issues/33239. So let's try discussing this here (non-substantive procedural comments to go to the Juleps repo.

I don't consider it entirely accurate to say that continuations are an abstraction violation.

I agree. But the problem is not Task, it's schedule. Likewise, I don't think closures are problem because they don't get executed by themselves. The caller always controls the execution and can see the end of it.

As has been discussed here, passing around nurseries gives you basically unrestricted dynamic task spawning again.

I don't think it's unrestricted. You can always reason about the scope (= nursery) of the task lexically because you either see @sync or the nursery as an argument.

But of course, you might never reach a cancel point, and even if you do the exception might be caught.

Yes, I agree it's unavoidable and especially hard to do in Julia because nobody wants random cancellation points to be inserted in their carefully written tight loops. Though I think it's possible to reduce some pitfalls. For example, it's possible to make cancellation exception harder to be caught accidentally by something like "chain of custody" #7026 coupled with a better exception hierarchy.

For example, it's possible to make cancellation exception harder to be caught accidentally

That's an interesting idea --- we could have those exceptions run all finally blocks but always propagate to the end of a Task.

That's an interesting idea --- we could have those exceptions run all finally blocks but always propagate to the end of a Task.

Yeah agreed that is an interesting idea. On one hand, i'm inclined to say this feels like overkill, because people are already used to this kind of thing for things like InterruptException, but on the other hand, there are plenty of places that just check e isa IterruptException && rethrow(), which would now block the task cancellation, which would be a shame. So yeah, i see the appeal of skipping the catch blocks for a task cancellation. 👍 And/or having a special kind of exception you can throw which is able to skip catch blocks (similar to how exit() certainly skips catch blocks 😛)

Just chiming in here to say I agree with @tkf's assessment! When you and Jameson and I discussed task cancellation, I also remarked that I think simply the ability to "cancel" a task such that it will throw an exception at the next yield-point would be enough to implement structured concurrency as discussed here. Basically everything else we already have.

This is particularly nice because when discussing task cancellation, @vtjnash's main concern was that you might cancel a task you directly spawned, say A, but not realizing that A has spawned B, and B continues running. He instead suggested that you want to close some resource which is threaded through from A to B and all tasks thereafter. But we noted that sometimes the resource in question is simply the CPU, and there is nothing else obvious to close.

The nice thing about structured concurrency here I guess, is that you can set things up such that the task will also cancel all the tasks its waiting on, so you can indeed close the whole set of spawned computation. This seems like a nice solution to the above problem, and I think the only thing missing before we can have this is the ability to safely throw an exception onto a Task. :)

@NHDaly Hi, good to have a CSP specialist chiming in :)

you can set things up such that the task will also cancel all the tasks its waiting on

Yeah, I agree that it's a nice property that this is the default behavior. But I'd like to add that it doesn't stop you from writing more complex handling of tasks as you can pass around multiple nurseries, if you really need to do it. It's also nice that this complexity is apparent in the source code as all the nurseries are always lexically visible.

By the way, the idea of using "narrower" catch and combining it with clever exception hierarchy is simply a clone of how it works in Python (and trio.Cancelled):

BaseException
 +-- SystemExit           # thrown by `exit()`
 +-- KeyboardInterrupt    # thrown by SIGINT
 +-- trio.Cancelled       # canceling tasks
 +-- ...
 +-- Exception            # "user errors" inherits Exception
      +-- NameError
      +-- ...

Also, I just noticed that there is a similar discussion in #15514. I think it's a reasonable solution even for InterruptException and exit() as well but @Keno seems to have another idea https://github.com/JuliaLang/julia/issues/35524#issuecomment-616687657?

For Ctrl-C handling in Structured Concurrency, Structured Concurrency Kickoff - Structured concurrency - Trio forum seems to be an interesting discussion (it started as a general discussion but it later focuses on Ctrl-C, from what I can tell by skimming it). (Edit: not much new info, I think)

@NHDaly Hi, good to have a CSP specialist chiming in :)

😛 Lol hardly! I've just been reading up on all this within the past year and a half or so as i've been learning more about concurrency and multithreading in julia. :) You might be thinking that because of https://github.com/NHDaly/CspExamples.jl, but that was something I wrote to learn CSP and to learn about julia's concurrency, not because i was already an expert :)

Thanks for the links + context!

Hey guys sorry to be silent here for a while.

Yes, I agree it's unavoidable and especially hard to do in Julia because nobody wants random cancellation points to be inserted in their carefully written tight loops.

Agreed, although efficiency is just one aspect. The more difficult part is that inserting cancel points can break the programmer's expectations in terms of resource handling. I think I was harping on about this recently with the fairly trivial example of Base.lock() not being async signal-safe for InterruptException (ah yes, here it is: https://github.com/JuliaLang/julia/issues/35524#issuecomment-625538029). We can obviously fix Base, but I think this emphasizes that writing code which is safe for use with InterruptException is currently subtle.

There seem to be some options for cancel points:

They could be restricted to certain function calls where the programmer is already thinking about error handling (trio uses IO for this)
We could have language syntax + semantics which makes resource acquisition and cleanup atomic with respect to cancellation. For this to succeed, it would have to be sufficiently nice that users write cancel-safe code by default.
Rely on users inserting explicit cancel points into their code
Make cancellation a property of resources which are being used by the code, rather than of the code itself. (I include this because Jameson has mentioned it on multiple occasions, though I don't fully understand how it would work.)

As has been discussed here, passing around nurseries gives you basically unrestricted dynamic task spawning again.

I don't think it's unrestricted. You can always reason about the scope (= nursery) of the task lexically because you either see @sync or the nursery as an argument.

Yes, exactly!

I think it's interesting to compare to Go, where people started off allowing for goroutines to be spawned anywhere, but current practice converged on passing the context everywhere (I think? I'm not a Go programmer though.) The point is to have a middle ground which is less onerous and more composable than passing a context everywhere explicitly, while also avoiding the downsides of being able to implicitly start new branches of a computation with side effects which outlive the caller.

There seem to be some options for cancel points:

For completeness, I think we might need something like disable_sigint but more generic (not just for sigint). Trio has CancelScope.shield, asyncio has shield, Kotlin has NonCancellable, and it's planned(?) for Project Loom (Java).

We could have language syntax + semantics which makes resource acquisition and cleanup atomic with respect to cancellation.

Are you thinking something better than the do block? Maybe the f(...)! syntax https://github.com/JuliaLang/julia/issues/7721#issuecomment-170942938?

But I think it's reasonable to assume that it'd take some time to land. Meanwhile, how about discourage using lock(l)/trylock(l) and recommending/implementing lock(f, l)/trylock(f, l)?

Are you thinking something better than the do block? Maybe the f(...)! syntax #7721 (comment)?

Oh yes nice find, I don't think I've seen this discussion. It would be doubly nice if it could remove more uses of finalizers which are just generally problematic in many ways :)

Here's a conversation which touched on @sync vs Experimental.@sync from the Julia Computing slack (we didn't set out to discuss this, but it ended up relevant so I said I'd repost it here for everyone to read):

Chris Foster 4:55 PM @jeffbezanson Here's the trio issue by @njsmith describing what's wrong with trio's handling of child task errors: https://github.com/python-trio/trio/issues/611 I think they reach mostly same conclusion we did today (I guess not by coincidence! I did read through that thread a while back.)
Of course we've got the advantage that our error handling syntax is half baked right now :smile: So at least if we get the internal representation of concurrent errors correct we can eventually make a nice API to go with it.
Jeff Bezanson 1:24 AM To fill everyone in, Chris & I had a discussion about task error handling and we decided:
- Make excstack a first-class object and replace the .backtrace field in Task with it. A bunch of the C code for munging it can go away and/or move to julia.
- Combine CapturedException, CompositeException, and TaskFailedException into one thing similar to trio's MultiError. Need a name for this! PropagatedExceptions? ExceptionTree?
- That new thing should always be used instead of passing tasks around, since Tasks are hard to serialize. Instead we need to process a RawExceptionStack when it's serialized. Not sure exactly how --- we could replace the raw stack with a processed one inside ExceptionTree, or the raw stack itself could be internally converted to a processed form.
Jameson Nash 4:43 AM I’m not sure I entirely agree about the last point, since it implies we can’t start printing information to the user about an error in one task until we have confirmation that all tasks in a sync block have finished. Accordingly, I think we may want to think about moving towards having show_error do more of the work to process it, so that sync doesn’t need to wait for it to be in a consumable form.
Jeff Bezanson 4:46 AM We don't need to ban printing tasks --- a task will still contain its exception and backtrace, and you can still look at them. sync can also return the raw trace data from tasks, it doesn't need to process them.
Jameson Nash 4:49 AM The current intent for Experimental.@sync however is to be unable to even get the trace data from all of the Tasks
Jeff Bezanson 4:59 AM It doesn't seem right to me to insist on that in every case. Surely there are cases where you need to potentially collect errors from multiple tasks?
Jameson Nash 5:07 AM You can’t have both that and early return, unless the return set is the list of Tasks which lets show_error then do advanced diagnostics
Jeff Bezanson 5:11 AM I think it's fine to have "return as soon as there is one failure, and tell me just that one failure". But I think what trio does is as soon as there is one failure, try to cancel all the other tasks, then wait for them, then gather exceptions from all the failed ones. So we might need to provide both options. Does show_error need anything from a Task besides the exception and backtrace?
Jameson Nash 5:19 AM Yeah, “try to cancel” is horribly problematic though on many levels
Jeff Bezanson 5:19 AM Ok but that is a separable issue --- you can also just do what @sync does today, with no canceling.
Jameson Nash 5:20 AM Yes, but then you can’t collect the errors. You either must return the Tasks, or just the first error.
Jeff Bezanson 5:21 AM Why? You could still wait for all of them to finish.
Jameson Nash 5:21 AM I think if it has the Task object itself, we may be able to do some more interesting work showing data dependency cycles, meaning we can print much more useful stacktraces Because not waiting for them all to finish is explicitly why Experimental.@sync exists as an improvement over @sync
Jeff Bezanson 5:23 AM So waiting for a set of tasks to finish is inherently wrong? I just don't buy that there is never a need to propagate exceptions from multiple tasks. It seems pretty fundamental in trio.
Jameson Nash 5:23 AM Not necessarily, that’s just what’s being proposed there And I agree we do want to get the exception from multiple tasks, I’m just saying we can’t have both the experimental design proposal, and follow the last bullet point (of not returning the Task itself) And if we return the Task itself, we can also print much more interesting backtraces, that the runtime system couldn have computed, but must be deferred until error printing time (when we can examine the heap)
Jeff Bezanson 5:28 AM Ok, well I'm not totally opposed to throwing the Tasks themselves; I see you could do things like check which of them might be waiting for others. And Tasks are convenient in that they already bundle an exception and a backtrace; otherwise we need to awkwardly pass around two values for each error.
Jeff Bezanson 5:36 AM Another question is, do we always have a Task whenever we want to propagate an exception, or do we sometimes need to just propagate a bare pair of exception+backtrace?
Jameson Nash 6:04 AM We might need to propagate a bare pair if the root task (the one with the sync call) failed
Chris Foster 10:26 PM

Yeah, “try to cancel” is horribly problematic though on many levels

I'd love to understand better why this is.
Cancellation seems fairly fundamental to the desire for usable structured concurrency rather than do-what-the-heck-you-like concurrency. Experimental.@sync just gives up on scoping of child tasks so I'm not at all convinced it's a long term improvement over @sync
Chris Foster 10:43 PM I understand that cancellation is hard because resource management is hard (more or less). Preemptive cancellation leaves resources dangling so you really can't do that without complete isolation between tasks enforced by the runtime (erlang style) so the runtime can GC them. Clearly that's completely incompatible with the Julia task model. Ok, so what are the options for cooperative cancellation? There's the trio / pthreads / etc style where cancellation is level triggered and causes a small well-documented set of IO functions to return an error. That seems pretty reasonable for non-buggy programs which do IO. @jameson I think you said the pthreads-like option is insufficient last time we discussed this, but I don't think you explained exactly why.
Chris Foster 10:57 PM Problem is, we can't cancel compute with that model and a lot of compute is what people are likely to be doing in Julia... So is this the main sticking point: there's no known models for safely cancelling compute which would fit with our runtime?
Jameson Nash 11:08 PM I’ve just never seen a cancellation system that isn’t some combination of:
- rife with mistakes in the choice of functions (IO is the worst possible choice, IMO)
- fails to provide any useful bound on time to completion (this is the thing you actually want to cancel)
- incompatible with proper cleanup of resources (esp. trio, which tries to pretend otherwise)
- deprecated due to those issues
Chris Foster 11:12 PM Hmm, interesting. So why is IO the worst choice, and what would be a better one?
Time to completion is a real killer.
Jameson Nash 11:13 PM Transaction memory would probably be the main one IO is the worst choice because it guarantees you’ll leave the other actors in a broken state The goal is to cancel pure work (which might include IO with nobody listening), but instead it cancels the work most likely to lead to permanent corruption (and not to say I don’t use kill -9 with abandon, but I usually then do need to go fix the filesystem) My proposal (if I had time to work on this more), would be to make sure we have a way to close the dead IO objects explicitly when the nursery is exited So I’d propose ignoring tasks completely, and make open the primitive that’s dynamically scoped inside a nursery For the same reason actually too that r = open(cat *); wait(r); read(r) is a deadlock, because it waits for the Task instead of the unit of work (the readall call needs to happen before the wait )
Chris Foster 11:21 PM So is your point something like... it's not so much nesting of task lifetime which is important in structuring concurrency, but something related — nesting of any side effects the tasks may have?
Jameson Nash 11:22 PM I also think this isn’t something you can retrofit into an existing language, but fortunately, Julia has strong safe management of IO resources and exposes those handles typically to the user, so it’s something we should be able to do Right.
Chris Foster 11:23 PM Ok very interesting. Very easy to confuse those two things, but it's side effects which really matter
Jameson Nash 11:23 PM Actor model (including Trio) gets you pretty far by asserting that one task must be equivalent to one side effect
Chris Foster 11:30 PM I'm just trying to digest all that. So the idea is that if tasks communicate exclusively with things which can be opened, then ensuring those channels are closed in a nested way can ensure side effects are nested? What about shared memory though?
Jameson Nash 11:30 PM But doesn’t seem to nest correctly? For example, what if I wanted to write:
```
with(database) do:
    for client in accept(port):
        @async with(client) do:
            handle(client)
```
- Chris Foster So, I feel like this is a really important point you're making here. But I don't understand it — would you be able to expand a little?
- Jameson Nash If one client dies, we don’t want it taking down the entire database. But so on client death, it may need to do IO with the database to cleanup and notify other live clients
- Jameson Nash But if the database dies, we need all of the tasks to exit without attempting to cleanup the database (which is already dead)
Jameson Nash 11:32 PM Shared memory I think also has a close function? Not all of them are created via the open verb as well. Some (like Channel) are simply constructed.
Chris Foster 11:35 PM

Shared memory I think also has a close function?

Sure, in principle, so no more sharing of normal Arrays? I'm not sure I'm understanding :slightly_smiling_face:
Jameson Nash 12:37 AM Ah, I thought you meant the stdlib of that name. I’ve done some design thinking on adding close to Array itself (currently available via empty! + reshape), but that seems tangential to this.
Chris Foster 12:51 AM Well I'm supposing that mutating memory (in a way which is visible to other tasks) should be correctly nested as much as other side effects should be. But I don't understand how that can be made practical.
Jeff Bezanson 1:00 AM This is a reason I think the current @sync is maybe not so crazy. If you can set up your tasks to work with cancellation, you can probably also set them up to just terminate. Cancellation just feels more magical and automated. Waiting for everything is a pain for debugging, but we can address that with better introspection and tooling i hope.
Jameson Nash 1:14 AM I think you might have close on a Lock, since that’s generally necessary the shared resource (the Array itself then being single-owner-at-a-time with exchange controlled by the lock)? I haven’t looked into that though. Yeah, we could probably implement deadlock detection, and then the current @sync design also might just magically work (deadlock here meaning detection that everyone in the sync-set is waiting for an internal, so non-IO, event)

I don't get why "IO is the worst choice." You should be ready for failures when dealing with IO. So, it seems to be the best choice.

Also, I'm not sure why it'll "leave the other actors in a broken state." Lexical scoping in structured concurrency is powerful exactly because it works nicely with lexically scoped resource management (Julia's do and Python's with).

I agree using resources as cancellation tokens is a good implementation strategy but I think it's rather a very low-level implementation detail. I think it's also hard to use in practice.

A case-study on using resources as cancellation tokens

In #34543, I wondered how to make Threads.foreach(f, channel::Channel) beahve nicely with errors. There are a few nice properties to have:

When one task throws, all other tasks terminate reasonably soon (structured concurrency).
Don't close the input channel.
All items take!n from the input channel are processed.

Current implementation in #34543 is, roughly speaking, something like this

function tforeach1(f, channel::Channel; ntasks=Threads.nthreads())
    stop = Threads.Atomic{Bool}(false)
    @sync for _ in 1:ntasks
        Threads.@spawn try
            for item in channel
                f(item)
                stop[] && break
            end
        catch
            stop[] = true
            rethrow()
        end
    end
    return nothing
end

tforeach1 satisfies the property 2 and 3 but not 1. This is because a task can be stuck on take!(channel) in the line for item in channel.

Initially, I thought it might be better to use the channel as the resource that is closed to propagate the cancellation:

function tforeach2(f, channel::Channel; ntasks=Threads.nthreads())
    @sync for _ in 1:ntasks
        Threads.@spawn try
            for item in channel
                f(item)
            end
        catch
            close(channel)
            rethrow()
        end
    end
    return nothing
end

tforeach2 satisfies the property 1 and 3 but not 2. That is to say, we shouldn't use the resource we do not own as the cancellation token. So how about creating the resource that we "own"?

function tforeach3(f, channel::Channel; ntasks=Threads.nthreads())
    owned_channel = Channel() do owned_channel
        for item in channel
            put!(owned_channel, item)
        end
    end
    @sync for _ in 1:ntasks
        Threads.@spawn try
            for item in owned_channel
                f(item)
            end
        catch
            close(owned_channel)
            rethrow()
        end
    end
    return nothing
end

Unfortunately, tforeach3 satisfies the property 1 and 2 but not 3. The owned_channel may be closed after an itme is take!n from the input channel.

If we have something like Go's select (i.e., we have a way to say "exactly one of those effect happens"), it'd be possible to do this. But it's rather cumbersome:

function tforeach_safe(f, channel::Channel; ntasks = Threads.nthreads())
    done = Channel{Nothing}(ntasks)
    try
        @sync for _ in 1:ntasks
            Threads.@spawn while true
                @select begin
                    item = take!(channel) begin
                        # `take!(channel)` happened (but `take!(done)` didn't)
                        try
                            f(item)
                        catch
                            put!(done, nothing)
                            rethrow()
                        end
                    end
                    take!(done) begin
                        # `take!(done)` happened (but `take!(channel)` didn't)
                        break
                    end
                end
            end
        end
    finally
        close(done)
    end
end

(Note: more precisely we need to use maybe_take!(::Channel{T}) :: Union{Nothing,Some{T}} instead of take!(channel) to handle the case channel is closed. But it's a pseudo-code anyway.)

Actually, tforeach_safe is not enough because we don't control which take! is preferred in @select (at least if it is Go-like)... I guess we need something like

function tforeach_safe2(f, channel::Channel; ntasks = Threads.nthreads())
    taskref = Ref{Task}()
    done = Channel{Nothing}(ntasks)
    @sync try
        request = Channel(taskref = taskref) do request
            while true
                response = take!(request)
                @select begin
                    item = take!(channel) begin
                        put!(response, item)
                    end
                    take!(done) begin
                        close(response)
                        break
                    end
                end
            end
        end
        Base.@sync_add taskref[]
        for _ in 1:ntasks
            Threads.@spawn begin
                response = Channel(1)
                while true
                    try
                        put!(request, response)
                    catch
                        break  # closed by other worker tasks (or no more items)
                    end
                    item = try
                        take!(response)
                    catch
                        break  # closed by the `request` task
                    end
                    try
                        f(item)
                    catch
                        # Allow no more request (but still process
                        # requests that are already sent):
                        close(request)
                        put!(done, nothing)
                        rethrow()
                    end
                end
            end
        end
    finally
        close(done)
    end
    return nothing
end

Combine CapturedException, CompositeException, and TaskFailedException into one thing similar to trio's MultiError. Need a name for this! PropagatedExceptions? ExceptionTree?

Also, just wanted to chime-in regarding this point, and point out this package we ended up writing now that we're using more concurrency inside our code at RelationalAI: https://github.com/NHDaly/ExceptionUnwrapping.jl

It's scary that some internal decisions to multithread some code can change the kinds of exceptions that are being thrown, so we're now using has_wrapped_exception(e, T) everywhere instead of e isa T in our catch blocks.

This seems relevant, though i think orthogonal, to the current discussion, so I just wanted to note it :)

I continue to try to find counter-examples to structured concurrency, or at least cases where it would uncomfortably constrain programming style. Recently I thought I found one but it turned out to be a false alarm. It's an interesting case to think through though.

The setting

For resource lifecycle handling, the open(thing) do resource ; ... end pattern is extremely useful because

The user can't forget to call close
The resource type owns the stack because it owns the open implementation; it can use block constructs like try ... finally, @sync, and control state can be kept in the stack. This seems closely related to the benefits of foldl based loops which @tkf has been emphasizing for a while.

The former is important for ergonomics, but the latter is what matters for structured concurrency as the open() may need to start async background tasks (eg, as a resource which communicates with a remote server). In strict structured concurrency these async background tasks can't escape the call to open(), so the pattern of passing a closure to be invoked with resource in the inner scope is very compatible with this.

For the same reason, the classic file IO like producer-consumer pattern of resource management with paired open/close can't be used in structured concurrency because it leaves no context to manage the child tasks:

resource = open(thing)
do_stuff(resource)
close(resource)

(That is, not without making the task nursery an explicit part of the open API's parameters, which is no good for composability as it essentially colors open() via calling convention.)

Why not simply transition to always using `do` blocks and closures for resource handling?

Well we should for most things, even just for the ergonomic benefits of writing code which always correctly closes resources.

But it's actually pretty inconvenient in the REPL! In the REPL you want to enter a context where resource is available and use it interactively. So the scoped resource management is pretty inconvenient in this important case.

I thought this was a problem but it actually isn't if the REPL itself maintains a nursery. One way out is to have a macro which, roughly speaking, turns the nested interface

open(thing) do resource
    do_stuff(resource)
end

into something which looks like the producer-consumer interface by introducing a task explicitly into the REPL nursery to manage the resource lifetime. Schematically,

ready = Channel()
done = Channel()
@async_in_ctx repl_nursery open(thing) do r
    global resource = r  # Or whatever is required to set this in the REPL context
    put!(ready, true)
    take!(done)
end

# REPL work ...
do_stuff(resource)

put!(done, true)

Working sketch

Here's a sketch that actually works in current Julia (though I resorted to bare @async because we don't have explicit nurseries, or a REPL which maintains one):

struct AsyncResource
    done::Channel
end

Base.close(r::AsyncResource) = put!(r.done, true)

macro asyncdo(ex)
    @assert ex.head == :(=)
    var = ex.args[1]
    call = ex.args[2]
    doblock = :(placeholder() do x
        # This `global` is problematic!
        # Should be an `outer` variable!?
        global $var = x
        put!(ready_done, true)
        @info "Ready $(current_task())"
        take!(ready_done)
        @info "Done $(current_task())"
        nothing
    end)
    doblock.args[1] = call
    quote
        ready_done = Channel()
        @async $doblock
        take!(ready_done)
        AsyncResource(ready_done)
    end
end

Kinda ugly usage example

julia> resource = @asyncdo io = open("resources.jl") # internally, calls do-based form of open()
[ Info: Ready Task (runnable) @0x00007ff3843549d0
AsyncResource(Channel{Any}(sz_max:0,sz_curr:0))

julia> read(io, 1)
1-element Array{UInt8,1}:
 0x23

julia> close(resource)
[ Info: Done Task (runnable) @0x00007ff3843549d0
true

julia> isopen(io)
false

Of course it looks pretty weird to use this for normal file streams as in this example. But I hope it makes sense if you imagine a world in which there's no such function open("file") but only the scoped form open(f, "file").

Any thoughts on how that might mesh with the io = open(path)! syntax proposal in https://github.com/JuliaLang/julia/issues/7721?

Any thoughts on how that might mesh with the io = open(path)! syntax proposal in #7721?

Aha, thanks for the very relevant link. I was hoping to re-read that discussion last week but I just couldn't find it. (More complete XRef: The relevant discussion of postfix-! syntax starts here https://github.com/JuliaLang/julia/issues/7721#issuecomment-170942938; Jeff suggested it could relate to a more general defer form here https://github.com/JuliaLang/julia/issues/7721#issuecomment-171004109.)

One interesting thing about defer-like approaches is that they avoid excessive nesting syntax (and, perhaps, strict nesting of resource lifetimes) and still provide guarantees about when cleanup will run. There's a lot to like about it, especially with the ! shorthand syntax.

On the other hand, having open(path)! surface syntax suggests that resources would be implemented as state machines with open and closed states rather than a higher order function accepting the user's callback. That's relevant because the following seem incompatible:

Using explicit state machines for resource handling, where the resource implementer returns the resource from open().
Structured concurrency with child task lifetimes which do not outlive function calls
Ensuring that the use of concurrency within open() is an implementation detail. (AKA, avoiding colored functions; colored by the calling convention of whether open() accepts a nursery or not.)

I think there's various ways to attack this apparent problem.

Don't use explicit state machines for resource handling; invent some lowering for ! (or other convenient surface syntax) which lets the user use them like a state machine but the implementer implement them as a normal higher order function. FLoops.jl is some interesting prior art for this kind of approach. (The AsyncResource sketch shows one ugly and inefficient way that a callback-based open() can be transformed into a state machine. Perhaps there's something less bad!)
Naturally, one could ditch structured concurrency. Seems like a pity!
Colored functions; make the calling convention for "resource functions" include a nursery in some way. Could possibly be viable with the right lowering for !, as the appearance of ! in the source code seems like it would induce an explicit chain of custody. (Cf. the way that Kotlin's coroutine context is passed)

No doubt there's other options.

My current thinking is that the higher-order function pattern does have a limitation in that it is too flexible.

For example, in FGenerators.jl (the "sister project" of FLoops.jl), I added FGenerators.@with so that open(...) do pattern can be used with good performance-oriented iterations over "data collections with resources". This is required to avoid Boxing in the closures.

The closure problem itself may be solved at some point. But I think it's valuable to encode the basic property that the "do block" is executed exactly once. For example, Rust has a similar FnOnce trait (although Rust's resource management is usually RAII). Python's with block satisfies this property exactly. OTOH, in FGenerators.@with, the user has to manually make sure this is the case. This is obviously not ideal (and doing so rigorously ATM is virtually impossible because of multiple dispatch).

the open() may need to start async background tasks

3. Ensuring that the use of concurrency within open() is an implementation detail.

These points are interesting. In a way, Trio is not doing structured concurrency super strictly if you are viewing from this standing point. That is to say, it allows you to leak resource since you can call __aenter__. So, you can define open/close as state machine quite easily even when you need nursery as a sub-resource:

class ResourceWithNursery:

    async def __aenter__(self):
        self.manager = trio.open_nursery()
        nursery = await self.manager.__aenter__()
        nursery.start_soon(trio.sleep, 0.1)
        print("Enter")

    async def __aexit__(self, exc_type, exc_value, traceback):
        print("Exit")
        await self.manager.__aexit__(exc_type, exc_value, traceback)

Of course, most of the time Python programmers do not write resource managers by hand. Instead, it'd be derived from the generator syntax:

from contextlib import asynccontextmanager

@asynccontextmanager
async def resource_with_nursery():
    async with trio.open_nursery() as nursery:
        nursery.start_soon(trio.sleep, 0.1)
        print("Enter")
        yield
        print("Exit")

I think this is close to the spirit of "the implementer implement them as a normal higher order function." From my experience of playing with IRTools.jl a bit, I think something like @asynccontextmanager is totally possible.

But, the "lesson" I'd derive from this is rather that it is OK to provide "unsafe" API as long as the API has the clear indication of unsafeness to nudge the implementer to be very careful. Even Rust can't save you from miss-implemented drop anyway. The open/close state machine would be one of such "unsafe" APIs. To explicitly denote that certain APIs are low-level, I think Python does a good job at it with the "dunder" methods. I'm not suggesting dunder as the only solution but, since we already have __init__, it may not be so crazy to have __open__ and __close__. It gives you some alarm if you see code calling these functions manually.

A bit aside, but I think one of the designing challenges of open(...)!/defer-like approach is what to do when you want to control the lexical scope where the close must happen. For something like a mutex it matters a lot.

Other options are to disallow defer calls in global scope, ignore them (let finalizers handle it), or implicitly turn them into finalizers in global scope. Not sure if any of those approaches simplifies things.

At global scope it would seem logical to turn defer handlers into finalizers as there's no static lifetime information. I'm unsure how defer would interact with closure captures, but we should probably discuss that in #7721.

OTOH, in FGenerators.@with, the user has to manually make sure this is the case. This is obviously not ideal (and doing so rigorously ATM is virtually impossible because of multiple dispatch).

Interesting point. Maybe we could have a wrapper which dynamically asserts the call-once property, but can be optimized away when enough type information is available.

Some interesting discussion of Go's context here:

https://news.ycombinator.com/item?id=24323564#24327585

Swift now has a set of WIP proposals for concurrency in general, including one on structured concurrency. From a fairly quick read, their proposal seems very much similar to the Trio approach. However it's an interesting and very relevant read in its own right, as it's quite concise and discusses structured concurrency as a language feature rather than at the library level:

https://github.com/DougGregor/swift-evolution/blob/structured-concurrency/proposals/nnnn-structured-concurrency.md

I found the section on cancellation interesting. The proposed model seems quite standard: it's fully cooperative cancellation where they expect IO operations to be cancellable. To quote:

With that said, the general expectation is that asynchronous functions should attempt to respond to cancellation by promptly throwing or returning. In most functions, it should be sufficient to rely on lower-level functions that can wait for a long time (for example, I/O functions or Task.Handle.get()) to check for cancellation and abort early. Functions which perform a large amount of synchronous computation may wish to periodically check for cancellation explicitly.

Somewhat related to all this, it seems Swift will go down the path of colored functions — see the WIP async/await proposal.

I find these "standard" cancelation proposals (based from trio, inherited from C?) to not be very helpful. They are supposed to give you implicit cancelation after a timeout, then provide neither. They still only cancel after specific points explicitly check for the condition (when reaching IO operations) and they thus still require you to have resources—they just don't let you clean up resources properly (close is also a cancelation point in C, but also probably will leak memory and break other state since any cleanup is skipped). And worse, they seem to add yet another "color" of functions (yellow?): there's those that are canceled (nursery) and those that aren't (detached). The cancellation that I'm arguing we already have is thus equivalent in functionality but better on each axis: we define that the cancellation happens when interacting with a resource ✔️, except the check really is implicit ✔️(via close state checking) and fully scoped and doesn't require passing along additional handles ✔️(just using the ones it already has). This wasn't possible in C since there's no resource management of file descriptors, so pthreads has this sort of cancellation instead. But we don't need to keep inheriting the limitations of that past, if we have an IO, Channel, and Event system planned to handle this.

Right, I know you've said you don't like other cancellation systems on multiple occasions. To be clear, I'm not arguing that in this swift proposal Swift is doing it "the right way", only pointing out that it's interesting to see. BTW, did you read it yourself — I'd be curious if you think my summary is accurate?

The cancellation that I'm arguing we already have

My problem with the cancellation we already have is that using @sync correctly seems to be very tricky and manual — it's extremely easy for a task to accidentally crash and not have its siblings be cancelled. It's necessary to manually pass around cancellation tokens if there's no resource naturally available. For cancelling compute (and memory access), I wish there was a way to check "should this task be cancelled?" without creating a Channel or some other resource and passing it by hand.

Experimental.@sync intentionally leak tasks which may still be running, but I don't think this is a good solution unless the leaked tasks truly have no observable side effects. In current Julia there's many ways a task can have side effects visible to other tasks.

Okay, the overall language- vs library-level choices and concurrency strategy decisions are way beyond my Julia pay grade, but there may be some short-term fixes available for one of the comments above that I ran across while poking around #32677.

it's extremely easy for a task to accidentally crash and not have its siblings be cancelled

Completely agreed with this. But I think it this might be a more tactical implementation issue - the @sync task "manager" currently just walks through the @async tasks in order, waiting for each to complete. This allows interdependencies between the @async tasks to lock things up (for example when a task later in the order dies while a task earlier in the order is waiting for it).

An improved parallel structure would be to schedule the @sync task after the completion (for any reason) of each @async task, and have it then scan across the tasks for errors/completion, and immediately throw exceptions that arise. Simple example in the #32677 comments. This won't stop programmatic lockups (one can still generate lockups with Channels if one tries) and it does not guarantee the shutting down of the sibling tasks on an error, but it would prevent lockups due to premature crashes and give the user a guaranteed stacktrace to debug with. Not a total solution nor a grand API shift, but a pretty decent improvement for a minimal change.

Catching up... I also just found https://github.com/JuliaConcurrent/Julio.jl

Does this package: 1) handle all issues discussed in this issue, if not, where are the gaps? 2) eliminate the need to do something in Base? why or why not?

Cc: @tkf

JuliaLang / julia