crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.35k stars 1.62k forks source link

[RFC] Fiber preemption, blocking calls and other concurrency issues #1454

Open technorama opened 9 years ago

technorama commented 9 years ago
waj commented 9 years ago

Are there any plans for fiber preemption?

Not at the moment. We first plan to support fiber concurrency and after that we might figure out if we really need this feature. It's a quite difficult feature to implement and any ideas are welcome.

Go is a language that inspire us in many areas and the goroutines are not fully preemptive: https://github.com/golang/go/issues/11462

Certain system calls block forever and don't work well with fibers

We can choose to use other system calls that do not block. Currently most IO is running through the event loop (and you did help a lot with this :-) ). waitpid is one example we need to fix. Do you have any others?

Library calls have the same issue

Yes, and if there is an alternative of a library call than can work asynchronously or with callback it should be used instead.

Libraries may make use of thread local data and have no provision for fiber local data

If the library makes use of thread local data it will be hard to use in an environment where you don't have control of where the fiber is executed (and resumed if preempted). We might think of pinning fibers to threads or have manual threads available for this cases. Do you have a concrete example of a library that relies on thread local data between calls? Not every library might be suitable to be used with Crystal anyway, and I don't think we need to design the language to do so.

Using pthread mutexes block the thread running the fiber without the ability to swap contexts.

Yes, don't use them. You can use channels to communicate fibers instead. Channels will support calls from multiple threads when they are implemented in the language. We might implement mutexes suitable for fibers in the future.

technorama commented 9 years ago

You could attempt erlang style preemption. All operations have a cost and fibers have priorities.

Any operation in the system costs reductions. This includes function calls in loops, calling built-in-functions (BIFs), garbage collecting heaps of that process[n1], storing/reading from ETS, sending messages (The size of the recipients mailbox counts, large mailboxes are more expensive to send to). This is quite pervasive, by the way. The Erlang regular expression library has been modified and instrumented even if it is written in C code. So when you have a long-running regular expression, you will be counted against it and preempted several times while it runs. Ports as well! Doing I/O on a port costs reductions, sending distributed messages has a cost, and so on. Much time has been spent to ensure that any kind of progress in the system has a reduction cost[n2].

Source: http://jlouisramblings.blogspot.com/2013/01/how-erlang-does-scheduling.html

technorama commented 9 years ago

I fixed the waitpid issue in #1295. It just needs to be merged.

Pthread mutexes are used in MANY libraries. I don't think you can avoid them completely. If mutexes are held by a library, followed by a callback (with the mutex held) and the context is switched the program may deadlock if the library is used by another fiber.

Library calls may take an unknown amount of time to return with no possibility for preemption. There may not be callbacks in the API. More threads is the general solution. Google is working on a application controlled context switching thread solution for linux. I can't find a link. It's similar to windows user mode threading and gives fiber like context switch performance without the drawbacks of fibers.

technorama commented 9 years ago

LibC.errno is an example of thread local data that can be destroyed after a fiber context switch. If there is preemption some method of tracking thread local data would need to be used. Auto converting to fiber local, automatically identifying thread local variable usage, or anti preemption directives are some possible solutions.

technorama commented 9 years ago

read() currently blocks for files with no portable way to work around it. It returns when the IO system has data to return. Setting NONBLOCK doesn't help. Similar issues exist with mmap which doesn't look like a system call at all, just a memory access.

waj commented 9 years ago

Preemption of Erlang processes is different because it occurs inside a VM. And even Erlang is affected by long native library calls and they discourage that because it affects the scheduler. Libraries like PCRE have been adapted to cooperate with the scheduler (https://github.com/erlang/otp/tree/maint/erts/emulator/pcre) but this doesn't count for any random library that you want to use in your program.

I didn't see your waitpid fix yet. I'll review it soon. Thanks!

I think this is the link that you're looking for? https://www.youtube.com/watch?v=KXuZi9aeGTw

I'm probably more inclined to Go style, where it doesn't fully preempt at any random place. Instead, it leaves the long call running while it reschedules the runnable coroutines in a new thread. That might makes the design a lot simpler while avoiding preemption locking or thread pinning. Not a final choice here... but I'd consider that option when the time of working with the multithreaded scheduler finally arrives.

technorama commented 9 years ago

A VM isn't necessary to account for runtime or operations performed. Count the number of primitive operations (or estimate them) and accumulate them in a fiber local variable. Add preempt checks inside loops and at function boundaries. External C calls could be timed and fast ones annotated with a fixed cost. This could put crystal ahead of go for soft realtime computing.

Thats's the video.

The Go GC thread sets an preempt atomic every 100us. I'm not sure where that's checked currently but it used to on memory allocation, or channel calls. I think it's checked in more places now.

waj commented 9 years ago

But... I'm confused. Accounting is not enough to avoid blocking system or library calls. And all that accounting I think will add some overhead that I'm not sure we want to pay. We need to think about all these things more deeply.

technorama commented 9 years ago

Different topics.

Accounting avoids starvation in loop { } and reduces latency on short running tasks when long tasks are mixed in (HTTP::Server quick requests mixed with cpu intensive requests).

For blocking system/library calls there is little that can be done other than let them run on threads.

choleraehyq commented 7 years ago

Goroutines can be preempted at at any non-inlined function call, and in the future tight loop will also be able to be preempted via checking a counter inserted by compiler in each loop. See https://github.com/golang/go/issues/10958

I'm a newbie to crystal, when can fiber be automatically yielded? Only IO blocks and sleep?

akzhan commented 7 years ago

@choleraehyq

https://github.com/golang/go/issues/10958 looks like bad proposal without hints to compiler due to performance issues. Ideally compiler need to know duration of every function call. Here are some 2016 old benchmarks: https://github.com/golang/go/issues/10958#issuecomment-261388774

So better to allow explicit inline call like yield fiber and maybe something like [@Preempt] attribute.

sdogruyol commented 7 years ago

I'd really like to have some fruitful discussion here to learn more from awesome people 😄

atlantis commented 1 year ago

Certain system calls block forever and don't work well with fibers

We can choose to use other system calls that do not block. Currently most IO is running through the event loop (and you did help a lot with this :-) ). waitpid is one example we need to fix. Do you have any others?

Library calls have the same issue

Yes, and if there is an alternative of a library call than can work asynchronously or with callback it should be used instead.

Leaving this here for posterity: I was just bitten by an IoT program stalling randomly for several seconds as soon as it lost internet and its normal MQTT connection. I did all the MQTT handling on a separate fiber, but the MQTT library uses TCPSocket, and the culprit apparently was that when internet goes away the reconnect code calls LibC.getaddrinfo, which blocks the main thread. Hence the entire Crystal program hangs until LibC.getaddrinfo decides to timeout (see https://forum.crystal-lang.org/t/400-cpu-usage-at-idle-thread-new-plus-channel-receive/5423, https://github.com/crystal-lang/crystal/issues/8376, etc).

So please add LibC.getaddrinfo to your list of calls that need to made non-blocking. As always, thanks for a great language!

bararchy commented 1 year ago

@atlantis nice point :eyes: I wonder if this should be handled sooner then later, as this might be easy fix that will save a lot of headache down the road

atlantis commented 1 year ago

@bararchy My temporary fix was to use 636f7374/durian.cr and the following monkey patch to fix HTTP::Client by default (which also fixes Crest), but this seems pretty fragile and would be great if it worked out of the box someday!


class HTTP::Client
  #this is a function that Durian adds to HTTP::Client
  def dns_resolver
    #DNSHack.resolver is a Durian resolver object I construct when starting up
    @dnsResolver || DNSHack.resolver 
  end
end
straight-shoota commented 1 month ago

This is an interesting article about how Golang implements fiber preemption:

https://unskilled.blog/posts/preemption-in-go-an-introduction/