[RFC] Concurrency, syntax and GC

ozra commented 9 years ago

This is a kind of open ended issue, touching on the subject of concurrency implementation and GC's, just felt it could be good food for thought to not get locked in a path prematurely.

At writing moment, there is just single thread concurrency in Crystal. Of course we all want multi thread. And of course we want things to be neat to use (also now).

The inspiration at the moment seems to primarily be Go for these areas.

As the prospect of a future Crystal specific precise GC (eventually we'll all want that - right?) might affect how concurrency is handled (memory passing wise), I'll just drop some thoughts.

There is some criticism on Go's way of using channels - but they might mainly be because of lack of generics. In any event - before locking things down in syntax perhaps some other angles of attack should (also) be considered.
Nim has a GC for each thread - a different model where no "stop the world" ever occurs, this requires "nothing shared" policy.
Go's latest GC effort uses multiple mark/scan threads to handle one GC - there is "stop the world" - but they're reducing the latency at expense of larger memory consumption ("the GC knob" setting).

That's all folks.

Issues somewhat related: #1454, #688.

refi64 commented 9 years ago

Felix's GC is also rather interesting, since it interoperates surprisingly well with other languages. I believe it is multithreaded and uses a stop-the-world mechanism.

ozra commented 9 years ago

@kirbyfan64 - any link? I'm curious about these things :)

refi64 commented 9 years ago

@ozra http://felix-lang.org/share/src/packages/gc.fdoc

Felix uses a garbage collector to provide secure automatic memory management.

For reasons of C and C++ compatibility, the allocator used is just malloc. Similarly, to ensure C and C++ interoperability, Felix GC uses a naive mark and sweep algorithm instead of a faster and more modern copying collector. C/C++ objects can't be moved easily and read or write barriers cannot be easily implemented.

The collector depends heavily on the Judy1 and JudyL data structures. These are ultra-high performance stores which provide fast, scalable, O(1) linear scanning both up and down, as well as random access, deletion, and insertion. Judy's key disadvantage is that it only works with machine word size keys and (for JudyL) machine word size data.

The Felix collector is a hybrid. Heap allocated objects are associated with a shape object which provides a precise map of the location of all pointers in the object, accelerating the scan for reachable objects and reducing the number of unreachable objects that are not collected. On the other hand the machine stack of each thread is scanned conservatively, since Felix has no precise information on stack allocated data.

The GC can handle interior pointers. These are pointers to an address past the starting point of an object. However Felix does not consider a pointer "one past the end" to be interior. Care must be taken not to increment pointers into arrays past the last element. This is a deliberate design choice: a past the end pointer might be the head of a consecutively allocated object. Without this constraint an allocator might be forced to introduce padding.

The GC can also handle C pointers mixed in with Felix pointers, in other words, where an object has a designated slot for a pointer, either a Felix or C pointer may be used. It can do this because it keeps track of all objects it allocated.

The GC also assumes all objects are arrays of fixed maximum extent. It tracks the maximum number of elements the array might contain as well as the actual number used. JudyL arrays map addresses to the base shape, the array bound, and the actual usage, with missing keys designating 1 element in the last two cases.

The GC also supports finalisation. By default the finaliser of a C++ object is a function calling the object's type's destructor. This means a complex data structure such as an STL map can be used in Felix, provided the value type is does not contain GC managed pointers. The finaliser calls the destructor to release storage, relieving the GC of the task of tracking the individual objects comprising the data structure.

Thus, the Felix GC is more efficient that one might expect because it does not need to track every allocation, nor scan every object. In addition, the programmer is free to use C style manual allocation and deallocation for many data types. Never-the-less the GC is required for some union types and some function and procedure closures. It is also useful for many function data structures including persistent purely functional lists.

We should note that the Felix GC does have one significant drawback: although it is thread safe, and performance suffers a bit from serialisation of allocations, collections require a voluntary world stop by all threads. There is no portable way to stop threads at arbitrary times, so if one thread requests a collection, all the threads must wait until the last one yields to the stop request, and then the collection is performed.

sdogruyol commented 9 years ago

This thread is gonna be interesting :+1:

ozra commented 9 years ago

Well Felix GC didn't inspire too much ;-)

What can be leveraged in Crystal is of course:

Types are known, so type maps are "already available" (specific scanning of just pointers)
All "hidden" pointers ("references" that is) are known to point to beginnings of objects, so there too cycles can be shaved, not having to worry about if it's "internal pointing".
When using pointers for hard core performance work, a specific BoundPointer type could be used (so GC can skip checking "internal pointing" and just look at the bound-obj-start-ptr in it (when working with slices and iterating in a method or so). These could also be bounds asserted in dev mode as a bonus. (this requires a pragma to tell crystal that GC should ignore a specific pointer member)
A specific "raw allocator" ("malloc") could benefit a specific GC, rather than relying on clib malloc.
And then the real reason for the issue: depending on if thread-specific GC-heaps or global is used, and how thread synchronization is done when time to stop the world (if the global approach is chosen) - well - this can possibly be tied together tighter too when designing it holistically.

If picking an outsider GC the advantages are harder to gain (generating intermediate type maps, or just brute force scanning, like I presume boehm does now)

But well, above scanning nitpicking (assuming a non reference counting GC) aside, the important thing really comes down to the concurrency issues.

stakach commented 8 years ago

I have been toying around with cross platform concurrency on Ruby for awhile now with https://github.com/cotag/libuv which primarily wraps all IO in promises and supports multiple event loops.

I think there are some really powerful things we could do with IO / threading on crystal.

Dedicated IO threads
- Spin up a maximum of one per-core
- Lazy load them with basic thresholds
- Shared nothing
Have seperate threads for processing
- Internally managed thread pool (can grow and shrink as required)
- The initial / main thread is the first of these
- Optionally allow users to execute code on their own dedicated threads
Promises and Futures for coordination
- So threads are transparent to the users
- Code is automatically run in parallel
- Heavy use of fibers

Effective this would allow you to write code like this:


stream = IO.new
stream.write 'hello'   # => Promise
# Writes are performed on an IO thread. The write itself is scheduled on that thread
# We can use Fibers to prevent further execution until the write has been buffered (not completed) so there is no concurrent access of the variable.
# Reference counting may provide a way to avoid a fiber yield here if this is the last reference to the variable or if the data is an object literal or constant etc
stream.write 'world'  # => Promise
resp = stream.recv  # => Promise
output = resp.value
# ---- Call to value is a future. Fiber yields here until response is received
# Once response is received we continue processing on the same thread
puts output

Further to this, if you would like to use multiple threads for processing


stream = IO.new #=> Promise
# stream itself is a promise, so easy to detect stream closure.
# write will yield the fiber until the stream is ready. We can provide on_ready callbacks too
stream.write('hello').then(:concurrently) do
    # Anything in a callback block could optionally be executed on a different thread
    # (Idle threads watching a work queue)
    # This call to write would only occur after the first write had completed
    stream.write 'world'  # => Promise
end
resp = stream.recv  # => Promise
output = resp.value
# The existing thread 
puts output

Much like Go Lang or .NET async, it would be easy to provide concurrent execution options:


def some_func(arg1)
    'return value'
end

# co keyword, for concurrently, to execute function on another thread
result = co some_func(arg1) # => Promise
puts 'no stopping me'
puts result.value

This approach takes the best of Node.JS and GO Lang and combines them with IO that never blocks. I'd be happy to start implementing this, I assume there is nothing preventing multiple threads executing crystal code? (I can disable the GC for now if it is an issue)

What are peoples thoughts on the syntax?

Also no changes would be required to the language internals. We can do this with C bindings and a few classes.

omninonsense commented 8 years ago

+1 for libuv, but I'm not sure I am a fan of extra syntax. Then again co could maybe be a macro, similar to the current concurrency macros.

waj commented 8 years ago

Are you guys completely aware of the current concurrency and IO model? Because Crystal already works with non blocking IO and schedules fibers. Working with callbacks and promises for IO is not the approach I like for this language.

What's so good with libuv? In fact I did the first implementation with it but then I moved to libevent because it fits better with the IO model based on coroutines instead of callbacks.

ozra commented 8 years ago

My know how in concurrency is really rudimentary, and I think I'm not the only one, so my partly stupid statements here might inspire someone to enlighten me.

Note: you do well in skipping the tl;dr "my limited experience" section

My limited experience for perspective

If someone feels like enlightening me

The experience I have with concurrency is node.js since 2010, and C++ over processes. In 2010 node.js there weren't many libs, so I just worked with what came up in my head. A pattern that I came to use almost everywhere was what I called the Reaper. You simple wrap the callbacks for tasks you want to "collect" with a call, and it kept reference count on them and fired "ready" when all where done. Then there were lots of crashes at a point. Turned out that teams worked at the same time in sort of "team work sessions" with specific things like uploading media (photos, mp3's and movies). I used a custom built mplayer and other tools to convert these to standardized bitrates and sizes. When the teams did this at the same time, the VPS ran out of memory, duh. So I thought of a funnel you pour lemonade into bottles with. I made a "Funnel" (which I've learned later exists as "Throttle" in established pattern terminology). Those two patterns did basically all the async-work in the entire node-app. Sequential async was just callback-stacks (with LiveScript back-calls it looks very synchronous). So that's my back-turned-on-the-world experience there.

The other side was a heavy computation system in C++. There's one data-samples collector, it writes the data packed via fix-pointing, then delta valuing, and finally variable length encoding and writing into fixed-element-size "chapters", to provide random access entry points at reasonable intervals. That is one producer. Then all the analytics artillery mmaped this data (64-bit arch was required for the system), read and decoded in to circular buffes on a "circular heap", synchronized for all data sources within each process. So there was one process for each avail core on the machine running at the same time. And of course also more machines, the data was replicated from the encoded buf. Synchronization per machine, was done simply through a "available-elements-counter" which was updated per "chapter" (instruction)atomically and fenced as new data had been written. (There where also read-ahead threads that simply stayed ahead of computations needs to make the kernel page in data earlier to avoid waiting for disk-snailing - hard to tell if there really was a benefit to this. Hard to benchmark it all.)

So all I know of concurrency is unfortunately my own constructs, which very likely overlaps with existing ones. (Reinventing the wheel seems to be my speciality).

Now

Most concurrency in most apps is just IO. And naturally taking one item of the grocery list, hopping on the bus, asking the guy at the counter for it, he getting it, returning home, then taking the next item, is a ridiculous waste of time. So of course Fibers are a super solution to be able to give the OS as many of hundreds of file-requests as possible so that it can order them well to the guy at the counter (storage) and that dude can swing his arms around the shelves for all items in the optimal order (hard-drive arm). Yes, I know I sound rudimentary stupid here, but, that's the point: I am :)

So, all I'm aiming at here is: those of us with rudimentary concurrency / parallelization skills would of course want a clean, uniform, well agreeing model that forms the foundation of all stdlib concurrency in Crystal (duh, again).

The Fibers, Threads, Scheduling and GC team

It seems to me libuv performance wise is superior to libevent2. The downsides are that not all of pthread API is available, the upside is that the reason for this is that it works optimally under Linux, Windows, Mac, etc. - a range of systems. It's actively maintained and developed, and it's in C.

What I'd like to see in Crystal is that all parts are orchestrated together so well that - I think optimally - a GC for each thread (the never-stop-the-world Nim model) could be integrated. It seems to me share-nothing would facilitate this well.

For solutions requiring parallel sharing (huge data) - highly specialized synchronization for the domain / algo could be used, for instance with above mentioned mmap'ing.

You guys (and gals of course) who know your stuff about this...

I'd love to learn a bit more. And hope my stupid questions / statements can be of some use. Of the top off my head:

How transparent can it be made? Will the multiple event loops be handled without any active consideration without clashes and locks?
Can one use blocking IO, and it just automatically schedules the next Fiber without special notation, continuing when the data is there (remember, I've worked with callbacks only, in node.js)
Can the condition variable aspects of libuv be used to facilitate cross-thread-calling in a transparent manner?
Will GC pressure be an issue for promises? (no matter what GC model is chosen in the future). And won't promises require "wordier code" that could be avoided for a more sync-style looking code?
Are these questions to stupid to be asked? B-)

stakach commented 8 years ago

What I'm proposing is to expose promises as an option. They are very powerful coordination constructs. They wouldn't have to be used and for the most part serial looking code with fibers can be used - this code is very easy to facilitate using promises.

Libuv make sense as it covers so many more OS primitives in a cross platform way. It has a lot of community involvement and is very consistent. TCP, UDP, Filesystem, Timers, Pipes, Locking, etc

Pushing all IO to dedicated threads prevents worker code from starving the event loop. Even if your code is non-blocking it will still take time to execute and this becomes a point of contention.

There is no feasible way to make cross thread calling transparent. Promises would make this easier. Promise pipelining effectively allows map-reduce style processing very simple and often without locks. Locks will be needed unless very deliberately avoided by the developer.

If multiple event loops are on dedicated threads then they can be handled without locking or clashes. Producer / Consumer style situation. Core application threads don't need to know which IO thread is handling the TCP socket they are working with.

ozra commented 8 years ago

I see now there were comments while I was writing my novel like monster of a comment.

@waj - since libuv and Boost::ASIO is all I know of these (via node and making C++ extensions for it), and the pure C++ apps. What made it difficult to get it to work with Crystal? Seeing that libevent2 seems to be "not that maintained" and the fewer platforms it supports mainly.

Also, is there a particular reason for the std concurrency constructs not supporting multi threading yet? So I get the grits...

@stakach - what would be the differences to the currently Crystal idiomatic concurrency constructs? Performance wise, and "boilerplate-wise"?

stakach commented 8 years ago

We should rip out all the legacy socket code and replace it with equivalent high performance code. Effectively all IO (File, Network, Pipes etc) go through Libuv running on dedicated IO threads.

Legacy classes can be wrapped to support non-blocking IO with Fibers. https://github.com/celluloid/celluloid does this to make regular ruby code non-blocking. I don't particularly like the actor model however that aspect of Celluloid is awesome and would make sense in a new language.

This would mean libraries don't need to deal with futures, promises or callbacks and it won't slow down highly parallel apps. Anyone who wants raw access to the async callbacks via promises or wants to deal with multiple threads and locking etc can without worrying about how the libraries are performing IO. the exception here would be C libraries however implementors of crystal libraries could use threads and futures / promises to wrap these

There would be a minor IO performance trade off for smaller applications, where it is unlikely to matter. However for high IO applications it would make a world of difference, for both code confidence and performance.

waj commented 8 years ago

I don't think the current IO implementation could be considered "legacy". Current IO already takes care of concurrent operations in a very transparent and comfortable way to me. Thousands of concurrent socket sessions for example, can be currently handled with high performance on a single thread. No paralellism is occurring, granted, and that's why we want to add support for multithread scheduler in the near future. Also, I don't think you can easily beat the current performance for concurrent IO sessions if you try to schedule the IO on separate threads. I'm worried that doing context switching on each IO operation is unnecessary and poorly comparable with the current implementation.

All that being said, I could probably be wrong, and you could demonstrate that to me if you wish to play with libuv and show some real numbers.

ysbaddaden commented 8 years ago

Crystal already has concurrency. All IO is already running on an event loop. We're not using libuv for different reasons. See @waj comment for instance, and IMHO it lacks multithread support and wraps far too much things to be any flexible (ie. it's tied to node.js as a single threaded event loop, and... nobody wants that for Crystal).

Libevent is also actively maintained and supports all operating systems, but it doesn't wrap everything, and has multithread support. IMHO it's a better fit for Crystal evolution to have something smaller and flexible.

What Crystal lacks, for the time being, is parallelism (an event loop that runs and resumes routinea across a pool of threads), and it's a complex thing to put together, as outlined here.

But again, Crystal already has concurrency (spawn) and synchronisation/communication (Channel) of concurrent code. One can already create an actor or promises library on top of it, thought we would prefer to have actors (and I guess supervisors) than promises into the stdlib.

omninonsense commented 8 years ago

What's so good with libuv? —@waj

It's a nice library with lots of general asynchronous task and AIO primitives. I wasn't aware that Crystal already uses libPCL and libevent2; I thought there was no concurrency or threading support at all, which is why I chipped in about adding support for libuv.

I think the real benefit of libuv would be if Crystal switched all IO and all to libuv. Using libuv just for multithreading would be adding unecessary bloat.

What's wrong with the current state of concurrency? I personally think using libPCL for Fibers is great, and as @ysbaddaden pointed out there's the spawn macro/Channel to make it convenient for use.

ysbaddaden commented 8 years ago

Just to clarify: fibers are an internal implementation detail and spawn is the public way to create coroutines (ie. dont use fibers directly).

technorama commented 8 years ago

Most of the foundation to get transparent parallelism working have already been completed by @waj. The IO system is (almost) ready for parallelism with minimal modification The scheduler needs a bit of work and there are thread safety issues scattered in a few places but there's not much left to do,

I have a crystal branch where multithreading (mostly) works. All work has stopped as @waj and @asterite have asked me to stop work on anything thread/parallelism related. I'm not sure why other than they want to do it themselves but also said they don't have a plan and are unwilling to work on a plan publicly, remotely, or in person.

A basic outline was sent to @asterite and @waj. You can see it here: https://github.com/technorama/crystal/wiki/Concurrency-Parallelism

There was also a proposal for adopting the Actor model. https://github.com/technorama/crystal/wiki/Actors

Please excuse the rough nature of the outlines. They assume you know the internals of the scheduler and IO system. Minimal effort was put in as I was already asked to stop working on threading.

Both proposals (and others) have been ignored.

There could have been a working multithreaded crystal over a month ago if I was not asked to stop. I am still unclear on the reasons.

At this point I have no idea how to get parallelism code in to crystal as all attempts to submit code, have plans reviewed or work towards a solution have mostly been ignored.

omninonsense commented 8 years ago

@ysbaddaden Is this going to change, though? I don't personally see a reason for using Fibers directly myself, since spawn seems sufficient in most cases, but it is odd.

ysbaddaden commented 8 years ago

@omninonsense no, just use spawn.

ozra commented 8 years ago

Nice to see the different perspectives. Concurrency is a wide subject, even more so with parallelism thrown in to the batter.

There are merits to all.

I've been to many Erlang/Haskell combo seminars hosted by Ulf Wiger, and I've used the "mailbox part" of that and supervisorish structuring, over processes. I like that. Within one app, the mailbox-method for "everything" would be quite heavy, I've been tinkering with ideas about it. I do believe the mailbox matched receiving is a better model than channels for many things, but channels are more low level and can, with more boiler, be used for more efficient flows. That's a pro for them.

So Actor model is very interesting, Fiber-based - not a thread per actor - imho. Sounds way to heavy.

As @waj points out, using more threads is not necessarily a win. A system is a complex eco-system of processes. So lets say your machine has got four cores. You've got four apps running on your machine most of the time. Now all of them decide it's more performant to use a thread for each core to utilize resources to the max. Now you've got four apps, running one thread on each core. Where's the gain? You've made a net-loss because of synchronization! Of course it's a contrived example.

@stakach - regarding starving the event loop, work always has to be done, and iff work has to be done before other work, there's no difference in throughput anyway, just an additional cost of synchronization. I'm not arguing against multi-thread usage - obviously. I'm just trying to highlight all the aspects of it as my understanding grows from all input here and else where.

For IO work, just getting as much orders to the kernel / net as possible is the biggest gain, the rest of formalia around loading / writing the data is miniscule. So I believe just Fibers works suitably well for that, since the "waiting work" is moved out to the kernel.

@waj - my main concern on libevent2 is - as I understand it - it's platform support is much smaller. Myself I've used Linux only since my Amiga got outdated, but I might very well have to code windows code for some client, and then I'd love to use Crystal. What's the view here?

@technorama @omninonsense @stakach - I assume Crystal team wants to get this "really" right also, and that is the reason why they've recommended the spawn/channels constructs only for now - to keep possibilities of change, and don't accept PR's at the drop of a hat. Once details are public, we're locked in (or in fiftyeleven different incompatible libs zone). And that's also the reason I opened this issue. Because it's encouraging to see some discussions and the reasoning behind it.

@technorama - thanks for the link to your proposals in the wiki!

@ysbaddaden - libuv is used as a single threaded loop in node.js, I assume it would murder V8 otherwise. That doesn't mean you can't run a loop per thread in crystal. I think I'm missing something here, because it seems very multi-thread-ready to me?

I like many of the ideas I hear here. So I hope something fruitful, simple and powerful, that all plays along comes out of it in the end.

ysbaddaden commented 8 years ago

Running 1 event loop per thread would tie a routine to a thread, with no means to resume it on another thread if a routine starts being less IO bound and more CPU bound.

Take Bcrypt for instance. It's CPU bound and it will block the event loop when computing a digest (it should have a cost of 1s or more to be secure), thus blocking all routines on the same thread whenever someone wants to login on your webapp. With a multithreaded event loop, they would be resumed on another thread —as long as they don't get all filled up with people login, that is :-)

stakach commented 8 years ago

@ysbaddaden that is a good example of why I think a dedicated IO thread would make sense. Removes the possibility of inadvertently blocking.

I guess the other reason I see an advantage in callbacks is that they can be scheduled on another thread, whilst fibers are pegged to the thread they were created on.

I am excited to see how threads are implemented @waj and also interested to see what impact offloaded IO has in terms of context switch overheads. I'll let you know once I have an example app

ozra commented 8 years ago

@ysbaddaden - thanks for clarifying! @stakach - looking forward to benches :-)

technorama commented 8 years ago

@ozra The first part of my proposal doesn't change the public API. Further phases add thread groups/pools and a simple API to create them or change the number of running threads (probably 3 new methods). I don't see any reason to turn the first few phases of the proposal down. It wouldn't harm future API changes as the API wouldn't change from it's current state.

ozra commented 8 years ago

@technorama - that is a good way forward in any event I'd think. The only costs are a few locks here and there I presume?

ozra commented 8 years ago

Just a question out of a short example, based on the simple example I mentioned: say I want to load 800 files (given here they will all fit in memory without problem), and giving the kernel and disk best opportunity to make that the fastest, I'd do something like this in node.js for simplicity (in LS - untested code):

fs = require "fs"
my-files = ["foo.file", "bar.file"] # 798 more ...
my-files-contents = new Array(800)
files-rr = Reaper()  # create a reaper instance 

for let f, i in my-files  # loop the files, make sure i is a "copy" for the lambda-closures
    fs.read-file f, files-rr.sow (err, content) ->  # wrap the callback in a cb from files reaper
        my-files-contents[i] = if err then null else content

<- files-rr.reap  # "all-files-done.then"
# Now work with the files contents as wanted synchronously from here
say "File #{files.0} contains:\n#{my-files-contents.0}"

Obviously one would likely work on the files asynchronously too, if more than one thread was avail, but lets take this example. And obviously this node 0.4 based methodology works with callbacks with its drawbacks (stack inversion in traces etc.)

How would this be expressed in the different asynchronous models mentioned?

stakach commented 8 years ago


files = ["foo.file", "bar.file", ...]

promises = files.collect do |file|
    File.open(file).then do |file|
        file.read(file.length)
    end
end

Q.all(promises).then(proc { |file_data|
    puts "File #{files[0]} contains:\n#{file_data[0]}"
}).catch(proc { |error|
    puts "error occurred reading files"
})

of course you could optionally use a fiber on the last promise there to avoid a callback:


files = ["foo.file", "bar.file", ...]

promises = files.collect do |file|
    File.open(file).then do |file|
        file.read(file.length)
    end
end

begin
    # Similar implementation here: https://github.com/cotag/libuv/blob/master/lib/libuv/coroutines.rb
    file_data = Q.all(promises).value
    puts "File #{files[0]} contains:\n#{file_data[0]}"
rescue => error
    puts "error occurred reading files"
end

technorama commented 8 years ago

@ozra Locks will work. Lock free lists would be better (requires cmpxchg). I asked about adding LLVM cmpxchg support to crystal but never received an answer. If someone wants to attempt this it would help (if accepted).

ysbaddaden commented 8 years ago

With existing Crystal (verified):

record Message, filename, contents
channel = Channel::Buffered(Message).new

filenames.each do |filename|
  spawn do
    contents = File.read(filename)
    channel.send(Message.new(filename, contents))
  end
end

filenames.size.times do
  message = channel.receive
  puts "#{ message.filename } contains #{ message.contents }"
end

stakach commented 8 years ago

@ysbaddaden Will the results be ordered?

i.e. If I want to do 3 things, how does one know if all 3 succeeded or failed and then process the results of those tasks?

kostya commented 8 years ago

pmap? if one failed you get exception (https://github.com/manastech/crystal/pull/1611)

filenames.pmap { |filename| File.read(filename) }

asterite commented 8 years ago

@stakach You could make Message's contents be a String | Exception (verified):

record Message, filename, contents
channel = Channel::Buffered(Message).new

filenames = %w(foo.cr goo.go goo.lala)

filenames.each do |filename|
  spawn do
    begin
      contents = File.read(filename)
      channel.send(Message.new(filename, contents))
    rescue ex
      channel.send(Message.new(filename, ex))
    end
  end
end

filenames.size.times do
  message = channel.receive
  filename = message.filename
  contents = message.contents
  case contents
  when String
    puts "#{filename} contains #{contents.inspect}"
  when Exception
    puts "#{filename} raised #{contents.inspect}"
  end
end

Or have the channel be of Message | Exception (but there you loose the filename), or maybe have another record Failure, filename, exception and have the channel be of Message | Failure.

The idea is that with spawn + channels you can do whatever you want. If you later need patterns on top of those primitives you can do them just fine because we can abstract over types, generics, etc, and we support union types.

sdogruyol commented 8 years ago

we can abstract over types, generics, etc, and we support union types.

@asterite is it zero cost abstraction?

asterite commented 8 years ago

@sdogruyol Depends on how you do it and how smart the compiler/LLVM is. If you use a wrapper struct than it's zero overhead. If you hide something behind a method and LLVM can inline it, it's zero overhead. If you wrap something with a class there's heap allocation overhead, unless the compiler some day does escape analysis and determines it can allocate that on the stack. And so on...

ozra commented 8 years ago

Nice discussion here, I like it.

@ysbaddaden - regarding the moving of coroutines to other threads - I'm not at all convinced that's a good thing (TM). There's a complexity increase in reasoning regarding possible thread global references, and it could be argued that the program has a design error to begin with. I/O and longer "side work" would be better of put on their own threads to begin with then. I have no problem with application "main logic" following a synchronousish flow and block slightly then and when, the parallel "outbreaks" should work on other threads. Sticking to a mainly synchronous flow is much easier to reason about, and a full scale parallel to the bone monster should be rather rare, and then other patterns akin to the Erlang side of the universe might be better incorporated (these low level things aside). Thoughts?

@waj - I'm sorry if I'm repeating myself and starting to sound like a nagging child, but what are the thoughts on platform breadth and event-lib chosen? I still feel libuv has an upper hand concerning that?

@technorama - I agree atomic "instructions" in Crystal is definitely a must, and since LVVM has them down for C++11 stdlib compatibility, we should have them in Crystal - (there is a dependency on libatomic currently, I haven't looked where/how it's used - dep-dep?), giving tighter integration for LLVM optimizations.

The fact that a certain atomic operation might not always be possible to generate on a certain target architecture should simply be "be aware of" information imo - I would prefer it to not compile instead of transparently replacing the ops with expensive mutexes etc. to emulate it. After all, one would use the atomic operations for performance, so when not available on a target, one would likely choose another implementation for that specifically, or drop support for it (trying a 64 bit atomic op on a 32 bit arch, etc.).

The go-style channels seems to me a slightly primitive form of working - granted they enable more complex abstractions on top - but the "go way" smells a bit of "if the hammer is the only tool, every problem looks like a nail" (I read somewhere, and found the phrase fitting). Intuitively though, it feels like they're a more performant way (memory overhead vs synchronization), but that's by just vibing it - no bench done here ;-)

The LS example I gave would have worked with multi thread parallel work (if the engine had it), since each index is orderly set specifically - the unique location is given before async work. Of course It's a very specific and simple example though. Whether an error occurred could also be determined synchronously without syncing, in the data afterwards - (assume correct working -> fast, spend time if shit went wrong). Of course if errors are expected and/or are show stoppers it would be better to set a flag to avoid unnecessary work being performed, and that would be very quick with an atomically set/read cond. var.

As a parentheses the above Reaper can take index/key as part of the collecting and take care of the whole ordered result array business when wanted, I avoided that to make the example clearer. It seems @kostya 's approach with pmap is very clean here.

I would really love to see several different common problems written in the different styles - it would be highly enlightening, also thoughts on performance and memory usage! (preferably benchmarks) In the end, of course straight forward code staving away from error prone flows is prio one. If anyone has a good sample and others would chip in with a different approach it would really make my day! Practical examples says more than a thousand flame wars :-)

ysbaddaden commented 8 years ago

Please note that I state my personal opinion here. Not waj or asterite or anybody from the core team.

@ozra How to solve my Bcrypt stops the world for 1 second problem? Granted, GC already stops the world, but for a tiny fraction of that time.

I can spawn a single coroutine that would compute Bcrypt digests and send it back through channels. But with a single event loop on a single thread? It would still stop the world, just preventing to have many coroutines that would do the same in a row.

Maybe Thread.new could create a new event loop. Sounds cool and easy. Even libuv can do it. I could spawn a thread instead of a coroutine and have it compute digests. No more stops the world! But how to send back the digest? Through channels? They'd have to work across different event loops... which sounds hard, because unnatural. Not even mentioning that we'd have 2 concurrency models, one is parallelism and the other one is concurrency only, but concurrent in their specific parallel world (I'm getting confused...). That's a lot to keep in mind when designing software. I guess I should just have a Bcrypt server and to communicate through an UNIX socket.

Now, maybe the event loop could run across many threads. That sounds complicated to put in place (and looks impossible with libuv), especially with efficient locking. I could then spawn my Bcrypt coroutine and communicate through channels, exactly like with the event loop running on a single thread: there is no difference for the developer, and a single concurrency and parallelism model to reason about. Except that computing a Bcrypt digest won't stop the world now (as long as there are at least 2 threads, of course), just slow it down.

ozra commented 8 years ago

@ysbaddaden - of course. Yes, for a computation taking one sec, I think I'd definitely would go with messaging, whether it's to a worker in another thread, or - I'd prefer - in it's own process. Using nanomsg, ZMQ or plain socket. My comment was on the idea that it runs in another thread, and is messaged. libuv multi-loops allows communicating through async message events "sent" into each other's loops, well: one way of doing it.

As soon as the cost of messaging is much lower than the work, I like the concept of external processes. Makes it possible to upgrade code of parts of the application, without stopping the main-logic, etc. But surely tastes vary. :) It might be because of the time I had with node that I liked delegating jobs out to daemons, seeing that node is just one thread.

ozra commented 8 years ago

Revisiting a bit.

Whether or not direct Thread control becomes an "acceptable use" or even "open part" of the languages/libs, a measure of control over fiber-thread-affinity (not thread-core-affinity, that's a different level entirely) is important.

One thing is making sure two pieces of algorithms / code are ensured to run in different threads, because they're very nature is making sure they work in parallel to maximize throughput. So some construct for spawning code and saying "in different-thread from some-fiber".

But then, for the second "level" of control, it might also be good to be able to set aside a core within a program for specific threads -> fibers, where one instead wants to maintain as high responsivity on action as possible.

These things should be possible to integrate in a fashionable manner without making it overly low-level smelling.

asterite commented 8 years ago

The GC and Runtime issue will be tackled by @waj and me (well, probably thought more by @waj and then we'll both implement it :-P), and we'll definitely see how other languages do this and see what fits best for Crystal. I can imagine implementing a GC of our own, because right now it's that or modify boehm, but in Crystal we can have a precise GC.

I'm closing this, a place like Google Groups is better for this kind of discussions.

crystal-lang / crystal