JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
44.97k stars 5.43k forks source link

Taking Structured Concurrency Seriously #33248

Open c42f opened 4 years ago

c42f commented 4 years ago

In julia 1.3 the threadsafe runtime for truly parallel tasks has arrived which will greatly increase the interest in concurrent computation in Julia's core numerical and technical computing community. For the moment, Threads.@spawn is experimental, but now is a good time to think about APIs for expressing concurrent computation in a safe and composable way with the aim to use such an API as an ecosystem gets built on top of the new functionality.

There's been a lot of recent excitement and hard work on libraries to support so-called structured concurrency, with the claim that this is fundamentally a better model for composable concurrent computation.

I'm convinced that people are onto something here and that the overall ideas are an excellent fit for the Julia language. However, there's a lot of prior art to absorb so I have started a document with the aim to do two things:

Click here to read the WIP document

Meta note

I'm using this julep as a guinea pig to try out a go-proposals style workflow, as discussed at https://github.com/JuliaLang/julia/issues/33239. So let's try discussing this here (non-substantive procedural comments to go to the Juleps repo.

tkf commented 4 years ago

I'm copying some comments I made in https://github.com/c42f/Juleps

Discuss "black box rule" and robust cancellation separately?

I think it might be better to emphasize that structured concurrency has a few sub-concepts.

My understanding is that what njs calls "black box rule" is a sub-concept of structured concurrency and it's more fundamental than robust cancellation. So, I'd suggest to introduce "black box rule" and cancellation as different sub-concepts from the get-go. I think this is particularly important in Julia because many compute-intensive task don't need cancellation and people may not want to have runtime and conceptual overhead of robust cancellation. An alternative term may be call tree.

One counter-example to my argument that "black box rule" is more fundamental is Go's errgroup and context. errgroup is the implementation of the "black box rule" while context is the implementation of robust cancellation. errgroup is implemented on top of context and it may contradict with my argument that "black box rule" (errgroup) is more fundamental than robust cancellation (errgroup). But I think the fact that they are implemented in separate packages in Go at least backs up my argument that those are different concepts.

So, maybe it is a good idea to organize the sections like this?

A benefit of having different color

I suggested to add the following paragraph at the end of Syntax section https://github.com/c42f/Juleps/pull/2

On the other hand, a syntax such as async/await is a useful visual marker that emphasise when the cancellation can happen (await) and what functions are cancellable (async). Note that this does not have to be implemented at language level. For example, Go's context and errgroup enforces reader to recognize where the cancellation can happen (listening to the Done channel) and what function can be cancelled (accept Context as an argument).

Also quoting another comment:

I just noticed that you touched this already in "The challenge of preemptive cancellation" but with relation to preemptive cancellation. But maybe it's useful to touch it a bit here in a different context?

Are go statements harmful?

Copying comment from https://github.com/c42f/Juleps/pull/3

OK... I think this can be very controversial. But I suggest to add:

It is important to emphasize that, as absence of goto statement that can cross function boundaries is the major feature of modern languages, the absence of "go statement" (@spawn in Julia 1.3) is the major feature of the structured concurrency. This is because it is impossible for caller to know that a function follows structured concurrency without the language-level guarantee.

To emphasize that it has large impact in Julia than (say) Python, I also added:

Note that the situation in Julia is more serious compared to other languages which has functions with "two-colors" (see below) and plugable eventloop (e.g., Python) where it is possible to enforce the absence of "go statement" at the library level.

c42f commented 4 years ago

Are go statements harmful?

I'm convinced they are harmful. But luckily this isn't a matter of opinion if we can agree that

  1. Structured concurrent programs are simpler to reason about because they are more restricted in form and have many benefits in composability.
  2. There is no significant penalty for giving up the flexibility of an unstructured approach.

The first of these doesn't seem controversial to me.

Lucky for us, the second can be attacked empirically due to the efforts in Trio: @njsmith has repeatedly stated that they've found no reason to back off from their insistence that all tasks are started from a nursery. The Kotlin people seem to agree (though they do have a nod to backward compatibility with the global coroutine scope). I think the other empirical answer we can have to this is to try implementing a suite of standard concurrency patterns ourselves in Awaits.jl or equivalent prototype.

c42f commented 4 years ago

A benefit of having different color

syntax such as async/await is a useful visual marker that emphasise when the cancellation can happen (await) and what functions are cancellable (await)

I agree this can be seen as a benefit and I've updated the document. In a cooperatively concurrent system it also marks where task switching can happen which can otherwise be a source of timing bugs, for example https://github.com/JuliaWeb/HTTP.jl/issues/382#issuecomment-497723980

tkf commented 4 years ago

Lucky for us, the second can be attacked empirically due to the efforts in Trio

I also add that Trio convinced Python core devs to add structured concurrency in Python itself: Yury Selivanov - Asyncio in Python 3 7 and 3 8 - YouTube. (Although it says asyncio.TaskGroup (≈ nursery) is likely to be in Python 3.8, I don't think it is added yet.)

when the cancellation can happen (await) and what functions are cancellable (await)

Oops, the second await should be async. Thanks for fixing it in the Julep :)

In a cooperatively concurrent system it also marks where task switching can happen which can otherwise be a source of timing bugs

Just to be clear, we are already in Go/Kotlin group as there is no async/await. We can still take Go's context approach to pass around cancellation tokens (it's the Done channel in Go's context). This would clarify cancellable functions from the call signature (you can't call it unless you pass the cancellation token). However, we can't use this mechanism for I/O functions because existing I/O functions do not have this signature and their color is the same as synchronous functions.

I requested to add that paragraph because I thought it was important to recognize the pros and cons of having async/await syntax to assess the advantage and disadvantage that already exist in Julia. I think the main discussion point might be "Should we pass around cancellation tokens explicitly?"

tkf commented 4 years ago

By the way, it'd be nice to have a list of compute-intensive applications that require cancellation.

One example I found/implemented was stoppable reduce in Transducers.jl. It can be used to implement parallel findfirst.

c42f commented 4 years ago

The following thread about how Rust's new async/await relates to SC needs a careful read. There's many interesting points related to cancellation and zero cost abstractions.

https://trio.discourse.group/t/structured-concurrency-in-rust/73

c42f commented 4 years ago

We can still take Go's context approach to pass around cancellation tokens (it's the Done channel in Go's context). This would clarify cancellable functions from the call signature (you can't call it unless you pass the cancellation token). However, we can't use this mechanism for I/O functions because existing I/O functions do not have this signature and their color is the same as synchronous functions.

We could add versions of IO which use this mechanism and make deprecating the old ones a long term (Julia 2.0) goal. So I think it's worth considering, if only to understand which points in the design space we're giving up on. Actually a full implementation of SC seems to be a julia 2.0 goal in any case because we already have bare @async and it would certainly be breaking to change how that works.

The action of passing around a context could be wrapped up with macros to form the expected lexically scoped chain of custody for cancellation without much syntactic overhead. This would indeed give us colored functions (colored by calling convention), but at least the syntactic price would be no worse than python and there's no language change required.

Here's a related question: are there any systems which don't have colored functions, and which also have a compelling cancellation story?

c42f commented 4 years ago

Over on the Julep, @vchuravy notes about the Tapir PR:

Note that this is primarily about:

  • Compiler representation of tasks
  • Experimenting with Cilk-style semantics (e.g. may happen in parallel or what I have taken to call structured parallelism)

My thought was that the outcome of the Tapir effort could be visible to users if we find that restricting semantics in a targeted way would greatly improve optimization opportunities. For example, maybe it could enable us to transform symmetric coroutines into lightweight state machines in some cases?

@vchuravy I think it would be really valuable if you could provide a summary of how you see this effort fitting into the bigger picture.

tkf commented 4 years ago

We could add versions of IO which use this mechanism and make deprecating the old ones a long term (Julia 2.0) goal.

I didn't think of this. That's an interesting possibility.

we already have bare @async

Good point. I forgot about it. I still think we can pretend to have structured concurrency before 2.0 since it is very likely that people are lured to @spawn (or whatever the API we'll have) because of the potential performance benefits.

The action of passing around a context could be wrapped up with macros to form the expected lexically scoped chain of custody for cancellation without much syntactic overhead.

That's actually the strategy I used in Awaits.jl.

are there any systems which don't have colored functions, and which also have a compelling cancellation story?

The only thing somewhat related to this I know is thredo (built on top of curio which is an async I/O library that inspired trio). In EuroPython 2018 (David Beazley - Die Threads - YouTube), the author talked about designing a threading library from scratch, motivated by the two-color problem and also robust cancellation (the talk is mostly demo of thredo). thredo API does not enforce the "black box rule" but it has cancellation and also has (optional) ThreadGroup (≈ nursery). The cancellation can happen only if you use thread building blocks like Queue, Lock, etc. imported from thredo. But it seems that this is treated as limitation rather than a feature for marking cancellable functions. For example, it has a monkey-patching helper to replace stdlib sockets used in third-party libraries with the thredo implementation. It also has implication in the CPU-intensive code as it's cancellable only when it hits curio's eventloop. He is mentioning that this is actually a desirable feature and I guess we all agree randomly inserting checkpoint would be awful for performance-oriented language like Julia.

But I've never played with it and I don't know if there is a discussion on structured concurrency aspect of thredo. AFAICT this is an experimental library and I don't think it's used in the wild. I'm just mentioning it since it seems to fill the design space of "cancellation without color."

tkf commented 4 years ago

are there any systems which don't have colored functions, and which also have a compelling cancellation story?

Java's Thread.interrupt was mentioned as an unsuccessful story in https://trio.discourse.group/t/structured-concurrency-in-rust/73/25

c42f commented 4 years ago

Hm. Cancellation seems to be a really, really hard problem, and I'm sure I don't understand how hard yet! But I'm naive enough to still think there's some hope, for example kill -9 is an example of timely, preemptive cancellation which actually works most of the time.

So what is it that makes kill -9 work when other things don't? It's (partly?) the strong isolation between resources of one process and another, allowing the resources to be gc'd by the OS kernel.

On slack, @vtjnash had some interesting things to say about that. To quote directly

I think the difference in practice is that sending kill -9 actually has the effect of closing all of the open resources. That’s a scenario the other processes are usually written to handle in some way. That it also stops running the code is somewhat irrelevant, because it’s already been cut off from all side-effect channels

I think it's interesting is that in the real world cancellation is definitely a thing: if your process starts behaving weirdly you kill -9 it. If that locks up sibling processes on your OS, you restart the machine. If that causes a distributed system to fail, you might need to restart a whole bunch of machines.

There's plenty of messy examples outside of computing. Consider companies; if a business unit is failing to make profit, it's eventually restructured or terminated and its duties redistributed. If that causes the company to fails to service its debts, the company goes into bankruptcy and gets forcibly restructured or just plain killed off (leaving the investors to do the garbage collection).

It's all very messy, but I think there's some kind of interesting informal hierarchy of supervision going on, and I've got a feeling that this is where Erlang supervisors might come in. More reading to do...

tkf commented 4 years ago

Why are you considering preemptive cancellation? Isn't the difficulty of it the motivation for cooperative cancellation? I don't think kill -9 or even a reboot is a good model for a safe cancellation. For example, this won't work if you have a filesystem-based lock (and that's why you need to do rm -f .git/index.lock).

Even in the non-computing example, if a company with copyright or license for an artwork got a hard shutdown, it may not be able to release the resource (= copyright) and then make the artwork becomes unusable. (OK. I'm not a lawyer and I have no idea if that's a reasonable scenario.)

c42f commented 4 years ago

Why are you considering preemptive cancellation?

An excellent question. Maybe because I'm naive and I want things which I don't know are impossible yet ;-)

Let's lay out some desirable properties for cancelling things which I hope we can agree would be nice. This is not a judgement about whether these are mutually possible! I've labeled each property for sake of further discussion. I would like cancellation to be:

It seems there's several models of cancellation which could be considered which might satisfy some subset of these properties.

  1. Cooperative cancellation with task-defined handlers (Trio / deferred pthread_cancel style) where cancel points are highly restricted, and tasks may run cleanup code.
  2. Hard preemptive cancellation (kill -9 style) where you need enough isolation so that resources can be GC'd. Seems incompatible with cleanup defined in the task itself. Has several known in-process APIs which have been a historical disaster; Thread.stop etc.
  3. Preemptive cancellation with task-defined cleanup handlers (InterruptException style) — a weird mixture where cancel points are "almost" everywhere, but task-defined cleanup can still run. Known to be a reliability disaster if just tossed into a language without thinking very hard about what "almost" means. (Possibly rust-like?)
  4. Resource-based cancellation; @vtjnash pointed this out on slack — "Julia-style cancellation, aka OO cancellation. Any IO object can be cancelled by calling close, which initiates termination of outstanding requests". This seems to be a very different model conceptually as it focuses on closing resources rather than cancelling computations (though closing a resource may lead to cancelling a computation).
stevengj commented 4 years ago

There is a formal structure of parallel programs that was developed for Cilk and is now used in Tapir/LLVM called serial semantics (also called the "serial elision" or "serial projection" property): this refers to a parallel program that could be correctly executed by running the tasks serially.

One nice thing about serial semantics is that it enables various analyses, e.g. efficient detection of race conditions, or static compiler optimizations in Tapir. So it is a nice property for a language to allow the programmer to express unambiguously.

Are serial semantics a strictly stronger restriction than "structured concurrency"?

tkf commented 4 years ago

@stevengj In Tapir PR https://github.com/JuliaLang/julia/pull/31086, @vchuravy said that

It is important to note that the semantics of this representation are parallel and not concurrent, by this extent this will not and cannot replace Julia Tasks.

So, I think it is not relevant for the cancellation part of structured concurrency. But I think it is related to the "black box rule" part (see my comment above https://github.com/JuliaLang/julia/issues/33248#issuecomment-531139728) of structured concurrency. I'm not familiar with the serial-projection property but this property seems to imply that "black box rule" must be followed (e.g., both of them disallow leaking tasks). However, as @vchuravy demonstrated you can write concurrent code following "black box rule":

In order to exemplify this issue see the following Julia task code:

@sync begin
    ch1 = Channel(0)
    ch2 = Channel(0)
    @async begin
        take!(ch1)
        put!(ch2, 1)
    end
    @async begin
        put!(ch1, 1)
        take!(ch2)
    end
end

Doing a serial projection of this code leads to a deadlock.

So, I think serial-projection property is strictly stronger requirement than "black box rule", too.

tkf commented 4 years ago

@c42f IIUC, I guess I can (almost) summarise your models by how Cancel-Triggered Exception (CTE) behaves, like this?

CTE can happen anywhere CTE happen only at await etc.
CTE catchable Model 3 (e.g., Java) Model 1 (e.g., Trio)
not catchable Model 2 (e.g. kill -9) N/A

But I don't understand what is "Julia-style." It seems to fit in Model 3. Also, I don't understand why "resource-based cancellation" is not doable in Model 1. Trio has "I/O resource" that can be used as communication (lock, channel, etc.) and they are aware of cancellation.

stevengj commented 4 years ago

So, I think serial-projection property is strictly stronger requirement than "black box rule", too.

Okay, so should we be talking about ways for users to express the stronger semantics (serial projection) or the weaker one (black box rule)? The former seems to allow strictly more tools and optimizations, so maybe it is more useful?

tkf commented 4 years ago

I'm in favor of having dedicated syntaxes to express the serial-projection property and black box rule.

I wonder if it is possible to "just" introduce a different version of @sync. That is to say for concurrent tasks we use

@sync begin
    ...
end

and for parallel computations with the serial-projection property we use

@sync_tapir begin  # maybe there is a better name
    ...
end

The body ... contains @spawn etc. In particular, replacing @sync_tapir with @sync is valid (the other way is not always true). This way, we can reuse functions for parallel and concurrent programs.

Naively thinking, this seems to be possible if we create a local "task group context" (a more sophisticated version of the vector assigned to Base.sync_varname; aka nursery) of different types in @sync and @sync_tapir. It is better to use type-based solution (rather than, say, macro-based solution) because it let us use the pattern to pass around the task group context and launch tasks inside other functions without immediately synchronizing them inside those functions. (FYI, if you want see actual (toy) code example, see, e.g., https://github.com/tkf/Awaits.jl/blob/24712770164cbd22c4de2568dc4ace1c75553eca/test/test_simple.jl#L21-L44 and Awaits.@go.)

c42f commented 4 years ago

Yes it seems like the serial projection property is rather restrictive and can't replace the general use of Tasks. But it's also a rather natural fit for data parallel problems in technical computing where the amount of parallel compute available at runtime is irrelevant to the algorithm.

So naively I'd favor having syntax for both; with @sync_tapir being an assertion by the programmer that the serial projection is valid for the contained code.

it let us use the pattern to pass around the task group context and launch tasks inside other functions without immediately synchronizing them inside those functions

I'm not so sure about this: serial projection seems to be a local property of an algorithm, and passing a @sync_tapir-generated context to an algorithm which doesn't have the serial projection property could cause deadlock. But perhaps I'm misunderstanding.

vchuravy commented 4 years ago

So naively I'd favor having syntax for both; with @sync_tapir being an assertion by the programmer that the serial projection is valid for the contained code.

I am worried about the additional complexity for the user to reason about the difference between a task with SPP and concurrent task. I am working on a couple of ideas, but I am hoping that we don't have to introduce a separate language concept. If that doesn't work out (you know research), we may want to consider having parallel constructs like for-loops with @threads or map to use tasks with SPP.

tkf commented 4 years ago

I totally agree that it's nice to have composable high-level abstractions like map, (parallel) foreach and reduce so that users can first try to solve their problem without getting into the details of parallel computing. (BTW, making reduce composable is where you start needing something like transducers :wink:)

But I'm very curious to know if all programs with SPP are expressable by the higher-order functions (or more like, by reduce, as other examples I brought up can be expressed in terms of it). If not, maybe it makes sense to have @sync_tapir as low-level "unsafe" construct?

tkf commented 4 years ago

if all programs with SPP are expressable by the higher-order functions

Actually, the fib example in the Tapir PR is a good example where reduce is not enough.

vchuravy commented 4 years ago

If not, maybe it makes sense to have @sync_tapir as low-level "unsafe" construct?

Yes I think this will be the case. But let's not get ahead of ourselves I have many issues to solve first :)

saolof commented 4 years ago

Personally, I think the 1.3 golang-style API with @spawn being analogous to Go's go statement is already significantly better than how most languages do it, and enough to single-handedly win over quite a few programmers frustrated with how the mainstream languages do it.

If we're going for structured concurrency as a possible way to handle task lifetimes, I think it's worth exploring the full parameter space of new ways to do it instead of specifically locking onto Trio's way of doing it.

For example, I think the biggest new trend in handling colored async code is algebraic effects (see https://overreacted.io/algebraic-effects-for-the-rest-of-us/ for an example of how they'd look in a dynamic language), where the task spawning is handled by an effect handler. This has the advantage of both being a much more general way of preserving the function abstraction on functions that do special things, rather than functions specifically. It's also nicer than trio's way of passing nurseries to inner scopes in the same way that async await or monadic programming is nicer than passing callbacks to inner scopes.

On the other hand, effects are not exactly something I'd expect the compiler team to want to implement in Julia in the near term. But they do illustrate the value in waiting before going for a specific highly opinionated way to handle concurrency before thinking carefully about interactions with other possible additions that could lead to a cleaner way to do it.

c42f commented 4 years ago

If we're going for structured concurrency as a possible way to handle task lifetimes, I think it's worth exploring the full parameter space of new ways to do it instead of specifically locking onto Trio's way of doing it.

Absolutely agreed, and thanks for the link. I think @andyferris was also encouraging me to look into effect systems but I'd largely forgotten about that good advice (sorry Andy :grimacing:). Reading the link, it appears that the Julia logging system is already implementing an effects system with the exact same nested task-local semantics you'd get from a real effects system! But without language support for effects it's more clunky than you'd like. In particular it's much more difficult to define new log handlers than it should be. I think this is because the logging system actually uses two different effects in concert corresponding to the shouldlog and handle_message functions.

These ideas also seem somewhat related to an idea I was tossing around for unifying exceptions with error codes via pattern matching (discussed at the bottom of the https://github.com/JuliaLang/julia/issues/7026 thread).

In all, very interesting; thank you!

tkf commented 4 years ago

Thank you @saolof, I was very vaguely aware of algebraic effect and the blog post was a nice introduction. I can imagine algebraic effect would be useful for other things like eliminating dynamic dispatch of log handlers:

(function (logger)
    try
        f()
    handle log :: LogEvent
        resume with handle(logger, log)
    end
end)(current_logger())

I agree with @c42f that it interacts a lot with the "chain of custody" proposal #7026. Also, I think naively applying this to I/O handler would give us "colored functions." That is to say, all the intermediate call chain has to "chain" that it would perform I/O.

But, IIUC, algebraic effect is mostly orthogonal to the discussion of structured concurrency. I think it's possible to have or not have structured concurrency with algebraic effect. That is to say, enforcing black box rule is equivalent to disallowing "global effect handler"; i.e., disallow spawn when there is no associated try-handle block (= nursery).


Speaking of colored function, I'm not feeling the async pressure | Armin Ronacher's Thoughts and Writings (Hacker News) presents yet another compelling reason to avoid introducing the colors. This post is about the importance of backpressure and how easy it is to shoot yourself in the foot with async programming. I think the part that is relevant to this issue is that it impossible to fix API in a backward compatible manner when you have colored functions and fixing it requires changing a sync function an async one (see the discussion around await writer.drain()).

c42f commented 4 years ago

For example, I think the biggest new trend in handling colored async code is algebraic effects (see https://overreacted.io/algebraic-effects-for-the-rest-of-us/ for an example of how they'd look in a dynamic language), where the task spawning is handled by an effect handler.

So I've done a bit more thinking/reading about this, and I can see how algebraic effects allow the user to define their own schedulers for tasks spawned in a given dynamic scope (see for example the start of this talk: https://www.janestreet.com/tech-talks/effective-programming/). That's extremely cool and interesting.

But I think @tkf has a good point:

But, IIUC, algebraic effect is mostly orthogonal to the discussion of structured concurrency. I think it's possible to have or not have structured concurrency with algebraic effect.

Yes, it's also unclear to me how algebraic effects ensure the key property of structured concurrency: control flow is naturally delimited by the local structure of the source code (aka the black-box rule). Rather, it seems that task lifetime when implemented with algebraic effects is controlled by the enclosing spawn-handler. The handler may well be near the root of the call tree which brings us back to mostly unstructured concurrency.

That is to say, enforcing black box rule is equivalent to disallowing "global effect handler"; i.e., disallow spawn when there is no associated try-handle block (= nursery).

I think the black box rule is much stricter than this: it requires that task lifetime is lexically rather than dynamically scoped?

tkf commented 4 years ago

it requires that task lifetime is lexically rather than dynamically scoped?

Ah, yes, you are right.

saolof commented 4 years ago

...So it ends up being a matter of lexical vs dynamic scoping. We already have some examples of dynamic control flow for concurrency with the async & sync macros, except that currently writing an async macro without an enclosing sync macro does not propagate an exception to the origin of the stack.

Handling concurrency with dynamic scoping has a few advantages: 1) It ensures that all tasks have a parent to propagate exceptions to at the moment of their creation (and fails early otherwise), which gives an easy way to cancel subtasks by throwing a cancellation exception at a yield point like in Trio. It makes ^C work as expected. It provides the expressiveness benefits of SC and a short happy eyeballs implementation.

2) Unlike lexically scoped SC, It allows higher order functions to have both synchronous and asynchronous functions passed to them without it having to know whether it is being used in a synchronous or asynchronous context. Higher order functions which are pure except for calls to their arguments are effect-polymorphic by default. Which means that you completely avoid the issue mentioned in what color is your function?

So it does have the advantage of being very expressive and minimizing verbosity.

With that said, when you consider interactions with lexical closures and callbacks, purely lexical scoping for trio-style nurseries does introduce the possibility of having two nurseries whose lifetimes overlap without one fully enclosing the other, which I don't think you can do with lexical scoping.

tkf commented 4 years ago

Handling concurrency with dynamic scoping has a few advantages:

  1. It ensures that all tasks have a parent to propagate exceptions to at the moment of their creation (and fails early otherwise), which gives an easy way to cancel subtasks by throwing a cancellation exception at a yield point like in Trio.

This is an advantage over unstructured concurrency and not over structured concurrency, right?

Also, I think dynamically scoped task group (= nursery) has a major disadvantage in that there is no way for a caller to tell if a function can spawn tasks that may outlive the call (unless it is combined with chain of custody). On the other hand, with lexically scoped task group you are forced to know that the function may spawn tasks as otherwise you can't call it.

2. Unlike lexically scoped SC, It allows higher order functions to have both synchronous and asynchronous functions passed to them without it having to know whether it is being used in a synchronous or asynchronous context.

It's possible to have structured concurrency without colored function. Kotlin and Go with context and errgroup do it. So, I don't think this is a valid advantage of dynamically scoped task group over lexically scoped one.

saolof commented 4 years ago

Also, I think dynamically scoped task group (= nursery) has a major disadvantage in that there is no way for a caller to tell if a function can spawn tasks that may outlive the call (unless it is combined with chain of custody). On the other hand, with lexically scoped task group you are forced to know that the function may spawn tasks as otherwise you can't call it.

This is exactly what is meant by function color. It's quite desirable to have calls to pure (sync) and effectful (async) functions look exactly similar so that you can write higher order functions once instead of having a thousand nearly-identical dispatches which are identical up to nursery/scope arguments depending on whether you are passing a sync or async function to it. With language support the compiler could still infer effect signatures for type stable code just like it can infer exception signatures, since effects are basically resumable exceptions.

As for dynamic vs lexical, I think it's largely a question of preference and of what will be the least surprising to users. Lexical has some nontrivial edge cases since you can end up passing around a closure over a nursery whose scope might have expired. If you want your cancellation mechanism to be done with (dynamically resolved) cancellation exceptions thrown from yield points, if that closure is called from a different coroutine you can end up with a cancellation exception being thrown in a task that isn't a subtask of the nursery that expired.

tkf commented 4 years ago

This is exactly what is meant by function color.

By different color I thought it usually means if the function yields or not (async or sync). Spawn or not is another property. But I get that function color can be generalized to it. That's why you can "fix" it by chain of custody.

It's quite desirable to have calls to pure (sync) and effectful (async) functions look exactly similar so that you can write higher order functions once instead of having a thousand nearly-identical dispatches which are identical up to nursery/scope arguments depending on whether you are passing a sync or async function to it.

I don't think this is a problem since we can just use closures. Something like

withtaskgroup() do nursery
    foldl((a, b) -> f(nursery, a, b), xs; init=x0)
end

Lexical has some nontrivial edge cases since you can end up passing around a closure over a nursery whose scope might have expired.

You can make it hard to misuse. For example, one idea may be to avoid referring to the task group context/nursery explicitly like using Awaits.@go instead of withtaskgroup above. Also, compared to this edge case, I think black box rule is a fundamental property; we all assume it in synchronous programming.

I understand dynamically scoped task group can get rid of this edge case. However, as you said, this is at the cost of not being able to refer to multiple task groups. This property is actually very useful. For example, this is required for implementing early deterministic termination in Transducers.reduce. You can see that I am accessing multiple nurseries in this prototype implemented using Trio.

tkf commented 4 years ago

@saolof @c42f you may be interested in: RFC: Make an abstraction for custom task parallel schedulers (GSoC proposal) - Internals & Design - JuliaLang

c42f commented 4 years ago

So it ends up being a matter of lexical vs dynamic scoping.

Yes, I think this is accurate. SC needs lexical scoping to ensure the black box rule.

This is exactly what is meant by function color. It's quite desirable to have calls to pure (sync) and effectful (async) functions look exactly similar so that you can write higher order functions once instead of having a thousand nearly-identical dispatches which are identical up to nursery/scope arguments depending on whether you are passing a sync or async function to it.

Terminology-wise I've been thinking of "colored functions" as generic terminology for splitting the world of functions into disjoint sets based on any property or properties (async-related or not). That is, in the graph coloring sense where the call graph obeys certain constraints based on the color of functions within it. Perhaps this is not how other people use it though.

At the moment Julia does not have colored functions at the language level; instead it's similar to Go: any function can arbitrarily call @async. This is helpful for composability but it has various downsides which the Go community seem to be addressing by explicit context passing. That solves some problems but passing contexts around gives you colored functions again by virtue of the calling convention. With extra boilerplate :grimacing:

c42f commented 4 years ago

An interesting use case of unstructured async is waiting for data to come back from a remote call. For example, Signals.jl uses this to rather neatly hide some detail in async_remote.

https://github.com/TsurHerman/Signals.jl/blob/master/src/async_remote.jl

Of course, this exact API isn't possible with structured concurrency. At the very least you'd have to pass a nursery though I wonder whether there's a more natural equivalent in structured concurrency land. For that matter, what do actors look like with structured concurrency? It turns out that someone's already trying to answer some of these questions; here's an actor library built on top of trio:

https://github.com/goodboy/tractor

At a quick browse it looks to work out quite naturally. Also their todo's include "support for reactive programming primitives" so some kind of signals may make an appearance.

tkf commented 4 years ago

Isn't this hard to do in structured concurrency because it can hide exceptions? I'm not familiar with reactive programming, but if a value that causes an exception is pushed but never pulled by a downstream, I suppose the exception is not going to be reported anywhere?

c42f commented 4 years ago

Right, I think reactive programming is an interesting case because it's not clear how it relates to structured concurrency and it seems like there might be a mismatch. But after a bit of thought I've got a feeling that it's a superficial mismatch.

With an async reactive system you've got an event loop running somewhere and this loop would need to be explicitly started (hence provide a place to propagate unhanded exceptions, and to cancel the event loop). On the other hand, for handled exceptions within signal propagation it seems to be an API choice how the signals library handles those. They could be dropped, propagated through the signal graph as error values, or something else. Those choices are available regardless of having structured concurrency or not.

tkf commented 4 years ago

this loop would need to be explicitly started

Yeah, I think that's the only place structured concurrency makes it "harder." ATM you can do this

function __init__()
    @async eventloop()
end

https://github.com/TsurHerman/Signals.jl/blob/b377eaedcd007fc5d56abe73b242e014cc330936/src/eventloop.jl#L13-L15

If we have structured concurrency you need to do something like this

witheventloop() do
    # code using Signals.jl here
end
# (maybe unhandled errors thrown here)
AriMKatz commented 4 years ago

Here's the swift concurrency manifesto. They are going with some variant of actors + other stuff. https://gist.github.com/lattner/31ed37682ef1576b16bca1432ea9f782

Great design discussion and review of other models from go to akka etc

AriMKatz commented 4 years ago

And here's an actors implementation by @richiejp https://github.com/richiejp/Actors.jl

StefanKarpinski commented 4 years ago

The Swift manifesto is interesting but doesn't seem to address structured concurrency at all.

AriMKatz commented 4 years ago

Sorry for the noise, I just skimmed/ pattern matched and assumed it was exhaustive.

StefanKarpinski commented 4 years ago

Thanks for bringing it to my attention in any case. I wouldn't expect structured concurrency in there—Swift is still addressing their basic concurrency design, they haven't gotten around to structured concurrency yet, although it would probably behoove them to consider it sooner rather than later.

tkf commented 4 years ago

I just noticed this, but it looks like Java (Loom) is going have structured concurrency (and nestable effect handler?)

Why Continuations are Coming to Java - YouTube

StefanKarpinski commented 4 years ago

Cool. We shouldn't let Java beat us to this :)

JeffBezanson commented 4 years ago

OK, I bit the bullet and clicked through to the 45-minute video. He clarifies that it is about one-shot continuations, which we already have via Tasks. Continuations are even more flexible than tasks, so farther away from structured concurrency along that spectrum. His view seemed to be that structured concurrency primitives are good but orthogonal to the existence of Tasks themselves, which I agree with --- i.e. @sync and schedule both exist and you use them as desired.

tkf commented 4 years ago

the 45-minute video

Do you mean the Java one? I don't think it's a good introduction video to structured concurrency. It just briefly mentions structured concurrency. I think a good introduction would be Nathaniel J. Smith - Trio: Async concurrency for mere mortals - PyCon 2018 - YouTube.

(There are a bunch of links in Structured concurrency resources - Structured concurrency - Trio forum and @c42f's Julep. Though I don't know what is the most concise introduction.)

@sync and schedule both exist and you use them as desired

It was a bit of exaggeration when I said "Java (Loom) is going have structured concurrency." It's not very clear from the talk if they are going to introduce structured concurrency in a strict form; i.e., disallowing @spawn/@async/schedule without @sync.

To me it seems like we have "goto" now, and might add "while" loops in the future. What's the problem?

--- https://github.com/JuliaLang/julia/pull/35686#issuecomment-622650037

The problem is that, as soon as you have the "goto" that can cross function boundaries (so not C-like goto; more like longjmp, I guess?), function call can't be used as the unit of reasoning for the flow of the program. It can return, throw, or continue execution anywhere. Furthermore the mere existence of such API is enough to break the abstraction. As soon as it's provided by the language, you can't expect a function call to return (or throw) when it finishes its job.

It's the same for structured concurrency. As soon as @sync-less @spawn/@async/schedule is possible (i.e., the language violates "black box rule"), you can't rely on the function call to be the unit of reasoning of the program.

We've had Tasks since v0.1, so if you need structured concurrency before Tasks we are definitely a bit late.

--- https://github.com/JuliaLang/julia/pull/35686#issuecomment-622653980

Yeah, I agree that it's impossible to have strict structured concurrency in 1.x.

My rant https://github.com/JuliaLang/julia/pull/35686#issuecomment-622605368 was that stabilizing @spawn without structured concurrency introduces yet another API to be deprecated before getting to strict structured concurrency. @spawn is also arguably the most attractive API for starting a task because it let people use parallelism. So I thought there was a good chance that it attracts people to start using structured concurrency if (non-strict) structured concurrency was introduced before stabilizing @spawn.

tkf commented 4 years ago

BTW, I think the main primitive we need for implementing structured concurrency is something like schedule(task, exception; error=true) (or more like schedule_soon?) that works even when the caller does not own task. By "it works" I mean in the sense it doesn't break the state of the scheduler. I think it should be allowed to break user code that is not exception-safe even when it is doing I/O. How hard is to implement?

We also need to embed cancellation points in the synchronization APIs like locks and channels but I think it'd be less hard.

@c42f What do you think? What is the main primitives we need for implementing cancellation?

JeffBezanson commented 4 years ago

Do you mean the Java one? I don't think it's a good introduction video to structured concurrency. It just briefly mentions structured concurrency. I think a good introduction would be Nathaniel J. Smith - Trio: Async concurrency for mere mortals - PyCon 2018 - YouTube.

Yes, I read the blog post, watched the video, bought the t-shirt. Look, "goto considered harmful" was 1960s clickbait. I also disagree with the analogy to "goto". The problem with goto is that it violates the procedural abstraction: naming a textual line in a program is not sufficient to describe a control flow state when there is a stack. Continuations fix that problem by wrapping a whole control state (whatever it is) in a first class object. I don't consider it entirely accurate to say that continuations are an abstraction violation.

The dichotomy between structured and not-structured concurrency is more illusory than first advertised. As has been discussed here, passing around nurseries gives you basically unrestricted dynamic task spawning again.

Anyway, I'm in favor of adding nurseries in some way, and cancellation. It should be possible to put a task in a state such that it will throw an exception at the next cancel point. From what I can tell that's basically what C# does (and others)? But of course, you might never reach a cancel point, and even if you do the exception might be caught. That's ok with me, I think it's unavoidable, just another reason this is not the same as "delete goto and everything is fixed".

JeffBezanson commented 4 years ago

It strikes me that many of the things said about tasks here could also be said about just closures. They allow variable bindings and their values to outlive a function invocation, and for part of the function to be "re-entered" after it has returned, but nobody thinks that's an abstraction violation.