Failure isolation in tasklets

Night-walker commented 9 years ago

An unhandled exception within a task should not propagate to other tasks. It should be isolated in the corresponding future value, ensuring that the program won't ruin like a card house when a single card is pulled out. That is, future value should be viewed not as a guaranteed value, but as a possible one.

Failure isolation is necessary for a robust, failure-resistant concurrency implementation. That is what you can see in Erlang and Rust. Not in Go, right -- but there panics are not 'ordinary' exceptions. Not in the sense of Dao exception handling -- panicking in Go is a last resort, a red button for self-destruction when all the defenders have fallen and the foes are ramming the last door. Even though Dao adopted the revised defer-panic-recover concept, its use spans over a lot of trivial cases which can often be resolved without bloodshed.

So, it's generally good if exception is not allowed to break out of its task and sink the mighty Titanic because a negligent sailor forgot to close the valve. If you really want to end the miserable life of your program in a single motion, there is an old but still practiced approach of leaving a death note and committing sys.exit().

But isolating a failure doesn't mean it should be allowed to be ignored as if nothing happened. If the task failed and its future value was not examined during its lifetime, the destructor of this future value should throw a warning signalling the user that a failure was overlooked.

The described change could be incorporated into the interace either by extending Future::value() result with possible Error (or list<Error>), or perhaps by defining its result as @T|none and adding a separate method for fetching possible error(s). The latter variant allows to benefit from @T|none => @T auto-casting, keeping the code simple when you are confident that no error can occur.

That's all.

dumblob commented 9 years ago

Hm, this tricky question is not easy to decide immediately for me as the proposed approach would require always to check the returned future value for errors, which is quite a contrary to what exceptions in Dao are good for. I'll try to find some data foundation about which approach seems better. Or maybe a combination - "internally" catch exception to the future value (while stopping the execution of the particular thread) and when reading the future value, the caught exception would be "rethrown", but this time in the collecting/main thread.

or perhaps by defining its result as @T|none and adding a separate method for fetching possible error(s).

This would disallow to return @T|none from the routine/method itself (where none might be a legitimate/expected result), so I'd rather avoid this approach.

Night-walker commented 9 years ago

the proposed approach would require always to check the returned future value for errors

Threads aren't used that often for it to be an inconvenience. And reliability usually matters more then few saved characters to type. Nevertheless, I mentioned how to avoid extra type checks with two methods.

Or maybe a combination - "internally" catch exception to the future value (while stopping the execution of the particular thread) and when reading the future value, the caught exception would be "rethrown", but this time in the collecting/main thread.

I have feeling this is how it works now, as I can hardly imagine how else one thread could catch an exception from another one given that their flows of execution have diverged. And this means you have to always safeguard the Future::value() call, so it boils down to the exact same situation.

This would disallow to return @T|none from the routine/method itself (where none might be a legitimate/expected result), so I'd rather avoid this approach.

It does not forbid anything, you can always distinguish the result having two methods.

dumblob commented 9 years ago

Actually, it seems to work yet a bit differently. An exception from thread is immediately issued. The thread is suspended and nothing causes the rest of the program to fail. Once you query for the .value() from the suspended thread, here be dragons - the whole VM immediately stops.

So, I'd like to have the exception caught to the future value (i.e. not issued immediately), and rethrown when .value() is called.

load os

routine r() {
  defer(any as e) { io.writeln('ERR_IN', e) }
  os.sleep(0.5)
  io.writeln('r')
}
routine r2() { std.error('r2') }

v1 = r()!!
v2 = r2()!!

std.exec {
  v1.value()
  defer(any as e) { io.writeln('ERR_OUT', e) }
  #v2.value()  # <-- try uncommenting this and dragons appear
}

io.writeln('end')

Night-walker commented 9 years ago

That's more or less how I suspected it to work. It is obvious that one has to safeguard potentially unsafe .value() calls as otherwise the whole program may die. Getting exceptions as normal return from future value instead makes the possibility of errors more apparent to the user, and allows to handle them without extra defer {...} or std.try(){}.

The downside, however, is the chance of ignoring the error when no result was required from the failed task. There are no means to force the programmer to check future value's value, and there certainly are cases when fail-fast is more appropriate and safe then confining exceptions within their contexts.

It depends on the program's infrastructure. If the latter consists of loosely coupled, relatively independent tasks (like actors), it's obvious that a single task's failure is not a reason to ruin the whole system. On the other hand, if the code solves a single task in parallel, where each step is required to be completed in order for the whole task to succeed, then suppressing local failures may be dangerous.

So it is eventually determined by the language's ideology and focus.

dumblob commented 9 years ago

So it is eventually determined by the language's ideology and focus.

Now we're building such language ideology ;)

I learned in past years, that Dao prefers exceptions. I don't like the idea of having the exception automatically transformed into a return value - it's inconsistent with the whole rest of Dao. One can instead use std.try() to catch them in the thread and return what he likes through the future .value().

dumblob commented 9 years ago

Catching exceptions directly in the thread makes it also a better practice as it's more localized which exceptions (and where exactly) can one actually expect.

Night-walker commented 9 years ago

I don't like the idea of having the exception automatically transformed into a return value - it's inconsistent with the whole rest of Dao.

Quite the contrary. There are many situations when "@T|none has already replaced error raising, and I am considering to increase the scope of this approach. The resulting code is often simpler to use, most likely more efficient and definitely more apparent (and therefore better controllable). It prompts the user to check for possible errors, reducing the chance of overlooking one.

Catching exceptions directly in the thread makes it also a better practice as it's more localized which exceptions (and where exactly) can one actually expect.

It's not a solution anyway. You can't rely on such abstract things as practices, conventions and "no-one-would-do-that" when dealing with someone else's code.

Overall, it seems for me that the chance to overlook a local failure is lesser evil comparing to the possibility of abnormal termination of all tasks -- they may be in undetermined state during failure, leading to incomplete/corrupted data being produced. Making errors more apparent by moving exception handling from external tools like defer {} to future value's own interface should probably be safer.

dumblob commented 9 years ago

There are many situations when "@T|none has already replaced error raising, and I am considering to increase the scope of this approach. The resulting code is often simpler to use, most likely more efficient and definitely more apparent (and therefore better controllable). It prompts the user to check for possible errors, reducing the chance of overlooking one.

I absolutely agree. We already discussed it somewhere. I myself don't like exceptions and consider them more like the Go panicking (i.e. the very last resort solution). Go on with replacing them with @T|none where possible (from the semantic point of view).

comparing to the possibility of abnormal termination of all tasks -- they may be in undetermined state during failure, leading to incomplete/corrupted data being produced.

Of course, that's how it's implemented currently, but not what I proposed. The consistency should be across threaded and non-threaded code e.g. like:

routine r() => int { std.error('really exceptional locally unrecoverable state'); return 5 }
std.exec {
  defer (any as e) { io.writeln('ERR', e) }
  ret = r()
  io.writeln(ret)
}

routine r() => int { std.error('really exceptional locally unrecoverable state'); return 5 }
std.exec {
  defer (any as e) { io.writeln('ERR', e) }
  ret = r()!!
  io.writeln(ret.value())
}

Note that the only modification I made is the withdrawal of the value, but everything else is the same. Of course, such rewrite doesn't matter in practice. What matters is the idea behind - i.e. just be aware, that the returned value has to be withdrawn with .value(), but nothing else.

Night-walker commented 9 years ago

In your code, if you don't place a defer block before fetching future value, the whole program will terminate. If there are other concurrent tasks which interact with the outer world, things may end bad.

Converting exception into value won't be different if one assumes that nothing can go wrong, but it is more likely to attract the attention of the programmer to the fact that an error may take place.

dumblob commented 9 years ago

In your code, if you don't place a defer block before fetching future value, the whole program will terminate. If there are other concurrent tasks which interact with the outer world, things may end bad.

Yes, but assuming there is no defer block in the examples, I still don't see why we should not terminate everything when threads are used and do terminate everything when they're not used. The risk is the same - in both cases we can have pretty much the same interaction with the outer world.

Converting exception into value won't be different if one assumes that nothing can go wrong, but it is more likely to attract the attention of the programmer to the fact that an error may take place.

Sure, but this decision (use return value or exception) should be left upon the programmer as it's highly case-specific as you described in https://github.com/daokoder/dao/issues/363#issuecomment-73403743 .

Night-walker commented 9 years ago

Yes, but assuming there is no defer block in the examples, I still don't see why we should not terminate everything when threads are used and do terminate everything when they're not used. The risk is the same - in both cases we can have pretty much the same interaction with the outer world.

There is no risk when there are no concurrent threads. When an exception is raised, the context in which it arises is in determined state, and it's the only context currently executing. No tasks running in the background which may be affected by unexpected termination. Conversely, the more tasks are active, the higher the risk is that they will be killed in an "inappropriate" moment because their state is undetermined at this point.

Sure, but this decision (use return value or exception) should be left upon the programmer as it's highly case-specific as you described in #363 (comment) .

It can't be left to the programmer because tasklets and future values are part of Dao core and the approach we choose will be the standard and idiomatic one.

dumblob commented 9 years ago

There is no risk when there are no concurrent threads.

Considering what happens in the outer world, we're talking about I/O (in Dao it's asynchronous by default), other processes and communication with them using whatever means, network communication (including also e.g. databases) etc. All this is of the same risk with or without threads when interrupted. So no determined state happens. All of them are undetermined per se.

It can't be left to the programmer because tasklets and future values are part of Dao core and the approach we choose will be the standard and idiomatic one.

I think it can be left upon him - either way, the programmer can get the error from the return value and throw it again and conversely catch the exception and return it as error. Both quite easy to achieve, but one of them is inconsistent (while silently allowing not checking for the returned error) and both change the place where the exceptional case is handled (instead of the place of routine/method call to the .value() or .the_separate_method_for_fetching_errors() call). Neither of them guarantee anything.

Btw what would you do in a case, when .the_separate_method_for_fetching_errors() returned some error(s) and the programmer would anyway call .value()? Returning none is nonsense (as none can be a legitimate value meaning something very different from oops, unexpected failure occured) and throwing an exception would again violate your precondition no running threads except for the main one are getting interrupted.

Night-walker commented 9 years ago

we're talking about I/O (in Dao it's asynchronous by default)

And where have you found async IO in Dao?) It's all synchronous: streams, sockets, all that stuff. Only os.process provides some non-blocking (but not asynchronous) IO.

All this is of the same risk with or without threads when interrupted. So no determined state happens. All of them are undetermined per se.

Fortunately, you're wrong) Normally, you can determine at which points an exception can be raised in your code, and where it cannot possibly appear. You can't insert an item into a map, write to a file, send to socket, etc. and raise error in a single thread at the same time. It always happens sequentially within a single context. And you can't corrupt internal Dao data this way unless some underlying C code is not written correctly.

But if a thread can be terminated from the outside, you can't know at which point in the thread this may happen. I'm not sure DaoVM is able to give firm guarantees about atomicity of operations, particularly when calling some black-box C functions. Terminating a thread could then easily break consistency of internal data, which cannot happen when the thread itself raises exception (again, assuming that the C code is fine).

So it's not the same. When a thread raises exception, it's normal and recoverable. When a thread is terminated, it's abnormal and potentially dangerous.

Btw what would you do in a case, when .the_separate_method_for_fetching_errors() returned some error(s) and the programmer would anyway call .value()? Returning none is nonsense (as none can be a legitimate value meaning something very different from oops, unexpected failure occured) and throwing an exception would again violate your precondition no running threads except for the main one are getting interrupted.

Return none, what else? The fact that it may be a normal return value can be checked with the dedicated method. In fact, why call .value() at all for tasks which don't return anything? Call the other method, say, .error() to wait for the task to end and see if it succeeded.

dumblob commented 9 years ago

Fortunately, you're wrong ... So it's not the same. When a thread raises exception, it's normal and recoverable. When a thread is terminated, it's abnormal and potentially dangerous.

Well, I thought you're talking about the outer world, not the internal Dao world. You're right, I meant non-blocking I/O. Still all the other examples hold. To me it doesn't matter if the DB TCP connection died and some other locally opened file got corrupted at the same time due to unhandled exception in single-threaded program or a multi-threaded one. The only difference is what you've written - in the internal Dao world (for me this doesn't matter in the case of unhandled exception as the resulting effects seem to be the same), but not outside.

Now, considering only the internal Dao world, I still am not confident, if it's such a big issue in practice. E.g. I haven't heard, that Go users would have significant problems with panics across multiple threads leading to corrupted data. I suppose, that caching and similar techniques used in the os-interface are so robust, that this simply doesn't make much harm. Of course, I'd deserve a hard spank for saying this, because we mustn't rely on that. But remember, we're still talking about an exceptional case.

Return none, what else? The fact that it may be a normal return value can be checked with the dedicated method. In fact, why call .value() at all for tasks which don't return anything? Call the other method, say, .error() to wait for the task to end and see if it succeeded.

routine get_result_from_run_on_random_item(l: list<@T>, routine<@T=>@T>) => none|@T {...}
routine r() {
  some_really_long_work_which_may_raise_exception()
}
routine work_with_big_impact(x) {
  if (x == none) do_especially_dangerous_initialization()
  do_the_the_important_work()
}
myl = (list<any>){}  # an empty list
feat = get_result_from_run_on_random_item(myl, r)!!

# silently_return_error approach {{{
if (e = feat.error(); e == none) io.writeln('ERR whoops, I switched the == for !=', e)
else work_with_big_impact(feat.value())
# }}}

# recklessly_issue_exception approach {{{
# version1: don't check exceptions for work_with_big_impact()
std.exec {
  defer (any as e) { io.writeln('ERR whoops, I migh have forgotten to use defer', e) }
  v = feat.value()
  defer (none) { work_with_big_impact(v) }
}
# version2: check also work_with_big_impact() for exceptions
std.exec {
  defer (any as e) { io.writeln('ERR whoops, I migh have forgotten to use defer', e) }
  work_with_big_impact(feat.value())
}
# }}}

No, none is not an option. I'm more afraid of running work_with_big_impact() in a good will than corrupting some data in threads (even in security corrupted data doesn't cause much harm). The _silently_returnerror approach solves the issue of not interrupting other threads accidentally, but opens door to zilion other errors by converting an exceptional case to a regular state the program can come to.

Night-walker commented 9 years ago

Well, I thought you're talking about the outer world, not the internal Dao world. You're right, I meant non-blocking I/O. Still all the other examples hold. To me it doesn't matter if the DB TCP connection died and some other locally opened file got corrupted at the same time due to unhandled exception in single-threaded program or a multi-threaded one. The only difference is what you've written - in the internal Dao world (for me this doesn't matter in the case of unhandled exception as the resulting effects seem to be the same), but not outside.

The difference is not only in effects on internal data. When you perform operations involving writing to a file, exchanging data over network, etc., they, of course, may end in raising exceptions. But those exceptions are nevertheless raised at certain determined points between individual high-level operations. You can at least be sure that lower-level code written in C (if it's correct) performs as intended without being interrupted somewhere in the middle of a wrapped C function -- Dao exception cannot compromise that. But thread termination can. It can catch a thread in arbitrary internal state, possibly breaking the atomicity of Dao-level operations. That is plainly impossible when you have just one thread which ends itself with exception.

For instance, if you have single thread calling net::Socket::send(), you're guaranteed that the data you passed to it will all be sent (unless an unrecoverable OS-level error/interruption occurs) regardless of what could have happened before the call and what may happen after it. But that is not true for concurrent environment: the thread may be terminated within the loop which repeatedly calls C's send() until all the data is transferred, thus resulting in data structure being corrupted.

No, none is not an option.

I don't see why. You'd have the same information with two methods as with a single one returning @T|Error. It's a viable option, even though not necessarily the best.

The silently_return_error approach solves the issue of not interrupting other threads accidentally, but opens door to zilion other errors by converting an exceptional case to a regular state the program can come to.

... and propagating errors to other threads solves the problem with suppressed exceptional state, but opens gates for countless other errors caused by unexpected termination of all program's threads. Unique errors which cannot be produced in any way other then by terminating threads relentlessly.

dumblob commented 9 years ago

The difference is not only in effects on internal data ... until all the data is transferred, thus resulting in data structure being corrupted.

I agree, but I still am biased due to one (furtunately the only one so far) extreme problem I made with treating an exception as a regular program state. The impact was horrible and everything would be better than that (it executed code which performed many important operations on remote machines instead of failing or reporting error or crashing or corrupting local data or even the whole local computer). That's why I'm not so much concerned about the "everything fails locally" approach.

This reminds me of the "holy war" for the proper way how to handle malloc() (or similar) returning NULL. in GNU SW (including glibc), it's common to immediately crash and don't care about anything (as it's the easiest way how to keep everything KISS, reliable and maintainable). But there are other guys, who say no and have many different (usually valid) arguments for that. Nevertheless, e.g. the GNU glibc is used on thousands of production, long-running and reliable systems.

In my opinion, we're now discussing something similar. I think, we should wait for @daokoder as he's able to choose the solution which'll better suit the DaoVM implementation.

Night-walker commented 9 years ago

This reminds me of the "holy war" for the proper way how to handle malloc() (or similar) returning NULL. in GNU SW (including glibc), it's common to immediately crash and don't care about anything (as it's the easiest way how to keep everything KISS, reliable and maintainable). But there are other guys, who say no and have many different (usually valid) arguments for that. Nevertheless, e.g. the GNU glibc is used on thousands of production, long-running and reliable systems.

NULL return from malloc() is extremely unlikely and indicates a severe problem. Task failure in multithreaded system should not necessarily be treated in similar fashion. For instance, web-server won't terminate itself if received a malformed packet from a single client. You wouldn't say that it makes it unreliable, would you? But there are no "right" and "wrong" approaches here, of course.

dumblob commented 9 years ago

NULL return from malloc() is extremely unlikely and indicates a severe problem.

This actually depends (only) on the os and e.g. in times, when memory was scarce, such optimistic allocators as Linux (and other modern platforms as well) uses were not the majority. So it might and might not be a severe problem.

For instance, web-server won't terminate itself if received a malformed packet from a single client.

According to my understanding of exception, if it's a packet, then it's nothing exceptional (even if it's content is random and useless for the web server). Exceptional would be if the packet itself wouldn't be a packet (e.g. due to some buffer overflow or whatever). Then it's exceptional, i.e. a serious issue.

Also, daemons have alway a different structure than other programs - e.g. I'd expect all of them to have a "master" catch for any exception at least to release resources properly, close descriptors, end connections in a friendly way etc.

But there are no "right" and "wrong" approaches here, of course.

Exactly, and that's why I want to wait for @daokoder.

Night-walker commented 9 years ago

This actually depends (only) on the os and e.g. in times, when memory was scarce, such optimistic allocators as Linux (and other modern platforms as well) uses were not the majority. So it might and might not be a severe problem.

Regardless of the OS, the inability to allocate heap memory is likely a fatal problem. Unless you know where else to go to get some more memory )

According to my understanding of exception, if it's a packet, then it's nothing exceptional (even if it's content is random and useless for the web server). Exceptional would be if the packet itself wouldn't be a packet (e.g. due to some buffer overflow or whatever). Then it's exceptional, i.e. a serious issue.

OK, a packet with incorrect payload length causing buffer overflow. Serious enough? )

Night-walker commented 9 years ago

I wonder if we can adopt both approaches ) Say, the current fail-fast .value(...) => @T and additional fault-tolerant .result(...) => @T|Error.

dumblob commented 9 years ago

Sounds viable - so I'm pro. Btw all the closing parentheses at the ends of paragraphs are smileys?

Night-walker commented 9 years ago

Btw all the closing parentheses at the ends of paragraphs are smileys?

Ehh... yeah :) Shortened versions. Don't get substituted into smile images.

daokoder commented 9 years ago

There are many situations when "@T|none has already replaced error raising, and I am considering to increase the scope of this approach.

I think the methods where X|none are used, are those that the errors are minor and the users are expected to handle them (almost) every time such methods are used. The scenarios for tasklets should be quit different.

Conversely, the more tasks are active, the higher the risk is that they will be killed in an "inappropriate" moment because their state is undetermined at this point.

This is not the case. I haven't go through all the comments, but I got the impression that the proposal is founded on the belief that a failed tasklet will bring all other tasklets down. But this is not the case, or at least it shouldn't be. In fact, those tasklets are expected to execute until the joining points. A simple example,

load os

routine Ok()
{
    for( i = 1 : 10 ){
        io.writeln( i ) 
        os.sleep(0.5)
    }   
}

routine Wrong()
{
    std.error( "error" )
}

Ok()!!
Wrong()!!

I think enforcing such behavior is better than risking the possibility (which is inevitable for the proposed solution) that a tasklet might fail silently without being handled.

Night-walker commented 9 years ago

This is not the case. I haven't go through all the comments, but I got the impression that the proposal is founded on the belief that a failed tasklet will bring all other tasklets down. But this is not the case, or at least it shouldn't be. In fact, those tasklets are expected to execute until the joining points.

Right, I had not considered such simple approach. But how do you deal with deadlocks? In producer-consumer scenario, what becomes of the consumer if the producer dies? Erlang, for instance, has notion of linked (dependent) processes which supposedly resolves such cases. I do remember some kind of "deadlock detected"-like message though...

daokoder commented 9 years ago

But how do you deal with deadlocks?

Wait and exit after all tasklets become in waiting states? This cannot eliminate, but can reduce the risk you mentioned. And it is actually the current behavior.

Night-walker commented 9 years ago

Wait and exit after all tasklets become in waiting states?

Emm, not very reliable... If there are independent (e.g, background) active threads, deadlocked tasks may hang indefinitely without any kind of indication that something went wrong. One won't even be able to release the deadlocked tasks if the need for this was not anticipated beforehand.

There should be some kind of a mechanism to help in such cases, at least to some extent. Something to tie tasks and/or means of communication/synchronization together.

Night-walker commented 9 years ago

The idea I have in mind revolves around channels. Channel is naturally a producer-consumer medium, so why not build means of task regulations right into it. If tasks register their roles at the channel they communicate through, many deadlock issues are resolved automatically.

So, suppose there are channel methods

register(self: Channel<@T>, role: enum<producer,consumer>)
unregister(self: Channel<@T>)

register() assigns certain role to the task calling the method. unregister() withdraws this role. If the task fails with exception, it automatically unregisters itself.

If a channel is empty and there are no registered producers, receiving from it will end the same way as if it was closed. Accordingly, sending into a full channel with no consumers should not happen. Sort of a smart channel :)

If one actor (producer or consumer) is launched before the other, the task responsible for spawning actors may temporarily take the role of a fake producer/consumer. It may sound a bit complicated, but it's a very straightforward and safe approach which ensures that (most) deadlocks caused either by an unexpected exception or erroneous logic simply won't happen.

dumblob commented 9 years ago

Well, if I understand the (un)register() approach correctly, these methods would need to be called in the producer/consumer routines, right? That means, that always on the other complementary place, one would need to know if the routine inside of it's body called register() or not. Otherwise, the behavior of receiving from the channel will be unknown. Not talking about calling the pair register() unregister() more than once in the routine.

Emm, not very reliable... If there are independent (e.g, background) active threads, deadlocked tasks may hang indefinitely without any kind of indication that something went wrong. One won't even be able to release the deadlocked tasks if the need for this was not anticipated beforehand.

I'm quite certain, that this is exactly the problem in any multi-threaded application. Actually, in this case, I'd rather provide some utmost solution for releasing deadlocked tasks (comes handy in case they're independent), rather then trying to make them pseudo-safe in advance while sacrificing clarity and simplicity.

Night-walker commented 9 years ago

That means, that always on the other complementary place, one would need to know if the routine inside of it's body called register() or not. Otherwise, the behavior of receiving from the channel will be unknown. Not talking about calling the pair register() unregister() more than once in the routine.

Then register() can be a code section routine which simply has certain scope. No need for unregister(), no call pairing, etc.

I'm quite certain, that this is exactly the problem in any multi-threaded application. Actually, in this case, I'd rather provide some utmost solution for releasing deadlocked tasks (comes handy in case they're independent), rather then trying to make them pseudo-safe in advance while sacrificing clarity and simplicity.

You can't release deadlocked tasks if you don't even know they're hanging. Also, releasing implies interacting with synchronization means (e.g., channels) used in those tasks.

BTW, neither simplicity nor clarity will suffer because of a few routine calls. Rather the contrary, one will clearly see what each task is supposed to do.

dumblob commented 9 years ago

Then register() can be a code section routine which simply has certain scope. No need for unregister(), no call pairing, etc.

How should it behave in the following case(s)?

load mt
ch = mt.Channel<int>()
routine r(c: mt.Channel) {
  for (i = 0:rand(50)) {
    c.register($producer) { c.send(5) }
  }
  for (i = 0:rand(50)) c.send(7)
  c.cap(0)  # close channel
}

# choose one of the following two possible use cases
# 1.
r(ch)!!
while (x = ch.receive(); x.status != $finished) ;

# 2.
#ch.register($consumer) {
#  r(ch)!!
#  while (x = ch.receive(); x.status != $finished) ;
#}

To me, it's either not clear how this will behave or maybe it is, but the behavior is awkward.

You can't release deadlocked tasks if you don't even know they're hanging. Also, releasing implies interacting with synchronization means (e.g., channels) used in those tasks.

I meant something really last-resort - similar to the watchdog concept, but with the ability to properly release resources etc. I have though no clear idea how it should be implemented and interfaced to user.

daokoder commented 9 years ago

I meant something really last-resort - similar to the watchdog concept, but with the ability to properly release resources etc.

I'd also prefer something like this. What I am considering is the ability to raise some kind of deadlock exceptions in the tasklets that are detected to be in deadlock states.

Night-walker commented 9 years ago

How should it behave in the following case(s)?

First, it doesn't make sense to call register() within a loop. It should be called once for a task. Then, closing the channel essentially becomes redundant.

Now, the cases. 1 -- should not send/receive since the consumers are missing. 2 -- OK; the spawned task does not inherit the parent task's role.

Night-walker commented 9 years ago

What I am considering is the ability to raise some kind of deadlock exceptions in the tasklets that are detected to be in deadlock states.

The problem is, again, that you can't easily detect deadlocked tasks without knowing dependencies between them. I'm trying to find way to make these dependencies transparent for the VM. Channels currently seem to be the best bet.

daokoder commented 9 years ago

The problem is, again, that you can't easily detect deadlocked tasks without knowing dependencies between them.

Actually, you can detect them, because a deadlock will make all tasklets waiting indefinitely.

Night-walker commented 9 years ago

Actually, you can detect them, because a deadlock will make all tasklets waiting indefinitely.

It may not make all tasklets waiting. Just some of them. And it would still be a deadlock.

daokoder commented 9 years ago

It may not make all tasklets waiting. Just some of them. And it would still be a deadlock.

Right, I was referring to the deadlock detection at the end of a program, but it may not be the case for programs that are supposed to run constantly like web servers etc.

Night-walker commented 9 years ago

Continuing with my line of thoughts. It could potentially be as simple as

@producer
routine r1(chan: mt::Channel<@T>, ...){
   ... # send to chan
}

@consumer
routine r2(chan: mt::Channel<@T>, ...){
   ... # receive from chan
}

routine run(){
   var chan = mt::Channel<int>(100)

   r1(chan)!!
   r2(chan)
}

dumblob commented 9 years ago

Continuing with my line of thoughts. It could potentially be as simple as

This looks better, but what about bidirectional channels? I wonder, why can't we actually make channels themself smart enough to catch exceptions and behave accordingly (prepare environment/resources for deadlock detection engine/"watchdog")?

In Dao, we have only two built-in "high-level" ways how the tasks/threads communicate (pass certain value) - futures and channels. So we could make both smart enough to prepare environment for the deadlock engine. Or am I missing something?

Night-walker commented 9 years ago

This looks better, but what about bidirectional channels?

I doubt using the same channel for sending and receiving in the same task is a good idea. It's simpler to just use a couple of them. Rust provides only half-duplex channels, for example.

I wonder, why can't we actually make channels themself smart enough to catch exceptions and behave accordingly (prepare environment/resources for deadlock detection engine/"watchdog")?

Emm... Not sure how you imagine this.

In Dao, we have only two built-in "high-level" ways how the tasks/threads communicate (pass certain value) - futures and channels. So we could make both smart enough to prepare environment for the deadlock engine. Or am I missing something?

Deadlocks with future values are different from those which can occur with channels. Particularly, a deadlock doesn't happen if a task awaited via its future value fails. Overall, trying to resolve deadlocks with future values is less important (and feasible).

Night-walker commented 9 years ago

Overall, I believe that trying to detect deadlocks post factum, by observing tasks and guessing what's happening, is doubtfully the answer. Formalizing the connections between tasks so that the state of things can be deduced from the given constraints -- that seems promising.

dumblob commented 9 years ago

Formalizing the connections between tasks so that the state of things can be deduced from the given constraints -- that seems promising.

Yes, under assumption, that the state of things is precisely given at compile time (which'll forever - think of NeoDao - disallow dynamic creation of channels, mutexes etc.).

Now, what about the following idea, which is a pure run-time detection, but is actually pre factum. Let's define deadlock as threads waiting indefinitely for other threads while both using some synchronization principle for the waiting. So, the following is not a deadlock:

load mt
c = mt.Channel<int>()
routine r() {
  while (true) ;
  c.send(5)
}
r()!!
c.receive()  # undetectable deadlock

A concept of a simple (but a bit slow) deadlock detection follows:

there is a global list of pairs (thread, thread) and it's mutex
before each channel operation (send/receive), a pair (current_thread, complementary_channel_endpoint_thread) is added to the global list and a routine checking for loops is run upon the global list; adding a pair and checking for loops (beware of the complementary operation) is performed inside pthread_mutex_lock() and pthread_mutex_unlock().
perform the send/receive operation
pthread_mutex_lock(), remove the added pair from the global list, pthread_mutex_unlock()

If there exists a special graph (edge-)representation structure, which allows better than O(n) (Brent's algorithm) time complexity for loop detection, please use it instead of array.

Night-walker commented 9 years ago

Too much overhead during data transmission/reception -- channels will become inefficient.

Night-walker commented 9 years ago

Yes, under assumption, that the state of things is precisely given at compile time (which'll forever - think of NeoDao - disallow dynamic creation of channels, mutexes etc.).

I did not make such assumption. What I described is also a run-time feature.

dumblob commented 9 years ago

Too much overhead during data transmission/reception -- channels will become inefficient.

But only for larger number of threads. Actually, we might introduce an option to channel operations to not check for cycles. Such programs would never guarantee the deadlock detection, but they would run faster with lots of threads. Btw according to http://stackoverflow.com/questions/20246417/how-to-detect-if-adding-an-edge-to-a-directed-graph-results-in-a-cycle , with large number of threads, it's highly probable, that the graph will be sparse (i.e. split into many small subgraphs), so the linear_search/BFS should be very quick.

I did not make such assumption. What I described is also a run-time feature.

Right. I misunderstood the formalization of the connections by imposing constraints on them.

Night-walker commented 9 years ago

But only for larger number of threads. Actually, we might introduce an option to channel operations to not check for cycles. Such programs would never guarantee the deadlock detection, but they would run faster with lots of threads. Btw according to http://stackoverflow.com/questions/20246417/how-to-detect-if-adding-an-edge-to-a-directed-graph-results-in-a-cycle , with large number of threads, it's highly probable, that the graph will be sparse (i.e. split into many small subgraphs), so the linear_search/BFS should be very quick.

Given that this check is supposed to be performed in a global critical section, it may easily become a bottleneck.

dumblob commented 9 years ago

Given that this check is supposed to be performed in a global critical section, it may easily become a bottleneck.

If a heavy communication between threads is needed, it's surely a bad design or some errorneous situation. But of course, it might degrade performance up to a single-threaded implementation.

Night-walker commented 9 years ago

Problematic cases exposed while considering the approach I proposed:

one task does not register itself as a participant in communications
one side (senders or receivers) is launched much earlier than the other one

I even revised the approach so that it does not affect transmission/reception at all (the channel is closed if the number of senders or receivers drops to zero) -- but as long as registration of participants is optional, the problem of implicit (non-registered) actors remain. Unless it's solved, this role model cannot be safely applied.

dumblob commented 9 years ago

one task does not register itself as a participant in communications

Actually, I thought from the very beginning, that this would be forced (otherwise the whole approach doesn't make sense to me) as the only possible channel use-case.

Btw if anybody is interested, I found two relevant articles dealing with runtime deadlock detection (and prevention) - in H. K. PYLA, S. VARADARAJAN: Deterministic Dynamic Deadlock Detection and Recovery, benchmarks showed, that for small number of threads (16), the approach I sketched above has negligible performance hit (larger number of threads wasn't measured); in E. Koskinen, M. Herlihy: Dreadlocks: Efficient Deadlock Detection, a transitive closure (containing only the relevant part of the global wait-for graph) for each particular spin-lock ("mutex") was used as part of a new spin-lock implementation and the performance was also satisfactory.

There is a huge research among this topic due to transactional memory and also there are many emerging (and already existing - e.g. TSX) instructions etc. in HW CPUs addressing these needs.

Night-walker commented 9 years ago

Actually, I thought from the very beginning, that this would be forced (otherwise the whole approach doesn't make sense to me) as the only possible channel use-case.

It cannot be reliably checked at compile time, which makes the use of channels non-obvious and obstructed. In the end, it may create more problems than resolve.

Night-walker commented 9 years ago

I think this issue in its current form can be closed. I jumped into conclusions based on false assumptions; the way errors are currently handled in tasklets is arguably more robust. As for deadlocks, I also have impression that leaving things in their current state is probably better. Anyway, a different issue can be opened for it in case some new ideas emerge.

daokoder / dao

Failure isolation in tasklets #363