proc_exit and traps do not stop thread executing blocking instructions

eloparco commented 1 year ago

With #1889, proc_exit and traps are properly propagated to all the threads in the process. Once a thread traps or calls proc_exit, all the other threads are stopped.

This is not true if the thread is executing a blocking operation (e.g. sleep, atomic.wait). In that case, the blocking operation is not interrupted by the exception propagation process.

Partial discussion here: https://github.com/bytecodealliance/wasm-micro-runtime/pull/1869#discussion_r1066524792

eloparco commented 1 year ago

@wenyongh @xujuntwt95329 @yamt after the initial discussion in https://github.com/bytecodealliance/wasm-micro-runtime/pull/1869#discussion_r1066524792, any suggestion on the possible implementation of such a mechanism? Initially I was thinking of using sigsetjmp/siglongjmp (to resume from here), but then I would need to keep track of sigjmp_bufs for each thread and also I don't know how portable it is for other non-linux platforms. Otherwise, I'm not sure on what we can do in the signal handlers to stop the threads gracefully (ideally we'd like move to the next instruction after sleep/atomic_wait so that the exception flag can be read to stop the execution).

loganek commented 1 year ago

I guess another approach would be to make all operations non-blocking in the host, but I'm not sure how much effort would that require (e.g. for WASI calls). Also, it might be problematic for user-defined native methods, so I think signals might be the way to go.

// edit We just had a discussion with @eloparco about the approach, he'll update the ticket with the notes.

eloparco commented 1 year ago

Two approaches:

make all operations non-blocking in the runtime [-] additional work every time a new blocking function is introduced [-] it doesn't work with user-defined functions [+] it integrates well with current exception handling mechanism progress:
- [x] #1929 (atomic.wait)
- [x] #1951 (poll_oneoff)
through signal handling: when propagating the exception to other threads, a signal is sent to those threads to stop their execution and return to the runtime; it requires the usage of setjmp and longjmp, similar to what is done with HW bound checks. [-] portability problem because of setjmp/longjmp (e.g. not available on Windows) [+] no additional work required if new blocking functions are introduced [+] works well with user-defined functions progress:
- [x] ~~#1946~~
- [x] ~~#1948~~
- [ ] ~~TODO: resource deallocation after exception~~
- [x] #2516

yamt commented 1 year ago

it wouldn't work with user-defined functions

in either approaches, i guess user-defined functions need to be adapted anyway. eg. resource cleanup on termination.

maybe pthread cancellation is another alternative to consider for posix-like platforms?

eloparco commented 1 year ago

in either approaches, i guess user-defined functions need to be adapted anyway. eg. resource cleanup on termination.

Why is that? If we use signals, the thread running the user-defined function would be stopped.

maybe pthread cancellation is another alternative to consider for posix-like platforms?

But then we wouldn't return to the runtime after stopping a (possibly blocking) instruction. It may even be fine for spawn threads, but what if it's the main thread?

loganek commented 1 year ago

I wonder what's the expected behavior of proc_exit? From what I see in the implementations, it just kills the process and doesn't allow for any resource cleanup (unlike C exit, which does a bit more before exiting, e.g. calling atexit callbacks). If that's the case, I don't think we have to worry about resource cleanup?

yamt commented 1 year ago

in either approaches, i guess user-defined functions need to be adapted anyway. eg. resource cleanup on termination.

Why is that? If we use signals, the thread running the user-defined function would be stopped.

for example, if a user-defined function uses malloc(), it somehow needs to catch the signal to free() it.

maybe pthread cancellation is another alternative to consider for posix-like platforms?

But then we wouldn't return to the runtime after stopping a (possibly blocking) instruction. It may even be fine for spawn threads, but what if it's the main thread?

sure. probably you're right it's even trickier than signals to use for this purpose.

eloparco commented 1 year ago

for example, if a user-defined function uses malloc(), it somehow needs to catch the signal to free() it.

how would you address that? not sure if there's something we can do once the user-defined function is interrupted

yamt commented 1 year ago

for example, if a user-defined function uses malloc(), it somehow needs to catch the signal to free() it.

how would you address that? not sure if there's something we can do once the user-defined function is interrupted

for example, we can make the user-defined functions deal with interruption by themselves.

loganek commented 1 year ago

for example, if a user-defined function uses malloc(), it somehow needs to catch the signal to free() it.

how would you address that? not sure if there's something we can do once the user-defined function is interrupted

for example, we can make the user-defined functions deal with interruption by themselves.

I wonder if we actually have to provide that capability; whereas C allows registering callback for exit() (using atexit), I don't think there's equivalent for traps in either posix or C. If really needed, we can provide atexit-like functionality for both traps and proc_exit, for native user-defined functions, but I don't know how critical it is to have it right now.

yamt commented 1 year ago

for example, if a user-defined function uses malloc(), it somehow needs to catch the signal to free() it.

how would you address that? not sure if there's something we can do once the user-defined function is interrupted

for example, we can make the user-defined functions deal with interruption by themselves.

I wonder if we actually have to provide that capability; whereas C allows registering callback for exit() (using atexit), I don't think there's equivalent for traps in either posix or C. If really needed, we can provide atexit-like functionality for both traps and proc_exit, for native user-defined functions, but I don't know how critical it is to have it right now.

for embedders like iwasm command, maybe it's enough to call host exit() and let the host os terminate threads.

on the other hand, wamr, as a library, should provide a way to clean up associated resources. termination of wasm instances doesn't necessarily involve termination of the host process.

loganek commented 1 year ago

for example, if a user-defined function uses malloc(), it somehow needs to catch the signal to free() it.

how would you address that? not sure if there's something we can do once the user-defined function is interrupted

for example, we can make the user-defined functions deal with interruption by themselves.

I wonder if we actually have to provide that capability; whereas C allows registering callback for exit() (using atexit), I don't think there's equivalent for traps in either posix or C. If really needed, we can provide atexit-like functionality for both traps and proc_exit, for native user-defined functions, but I don't know how critical it is to have it right now.

for embedders like iwasm command, maybe it's enough to call host exit() and let the host os terminate threads.

on the other hand, wamr, as a library, should provide a way to clean up associated resources. termination of wasm instances doesn't necessarily involve termination of the host process.

I didn't mean to call exit() and close the host. What I meant is that maybe we don't have to worry about user-defined functions, and we can use signals and interruptions. Embedders, if needed, can handle resource cleanup within the process after the runtime finalizes.

yamt commented 1 year ago

for example, if a user-defined function uses malloc(), it somehow needs to catch the signal to free() it.

how would you address that? not sure if there's something we can do once the user-defined function is interrupted

for example, we can make the user-defined functions deal with interruption by themselves.

I wonder if we actually have to provide that capability; whereas C allows registering callback for exit() (using atexit), I don't think there's equivalent for traps in either posix or C. If really needed, we can provide atexit-like functionality for both traps and proc_exit, for native user-defined functions, but I don't know how critical it is to have it right now.

for embedders like iwasm command, maybe it's enough to call host exit() and let the host os terminate threads. on the other hand, wamr, as a library, should provide a way to clean up associated resources. termination of wasm instances doesn't necessarily involve termination of the host process.

I didn't mean to call exit() and close the host. What I meant is that maybe we don't have to worry about user-defined functions, and we can use signals and interruptions. Embedders, if needed, can handle resource cleanup within the process after the runtime finalizes.

it's certainly possible to code host functions that way.

however, it isn't how we code host functions right now. (including the ones within wamr tree like wasi) so, most of host functions need to be adapted anyway. also, my honest impression is that the approach is too difficult for average programmers around me.

in other words, i prefer "if a host function doesn't implement async termination properly, it can't be async terminated (eg. possibly block forever)" over "if a host function doesn't implement async termination properly, it might cause problems like resource leak on async termination."

loganek commented 1 year ago

In that case I guess this could be a compilation/runtime flag. We worked in parallel on making the wait instruction interruptable: https://github.com/bytecodealliance/wasm-micro-runtime/pull/1929/files

yamt commented 1 year ago

In that case I guess this could be a compilation/runtime flag. We worked in parallel on making the wait instruction interruptable: https://github.com/bytecodealliance/wasm-micro-runtime/pull/1929/files

i guess per host function flag makes more sense than a compilation/runtime flag.

loganek commented 1 year ago

i guess per host function flag makes more sense than a compilation/runtime flag.

What do you mean by host function flag?

yamt commented 1 year ago

i guess per host function flag makes more sense than a compilation/runtime flag.

What do you mean by host function flag?

eg. add a member to WASMFunctionImport to specify if it's safe to terminate the function that way.

loganek commented 1 year ago

From https://github.com/bytecodealliance/wasm-micro-runtime/pull/1869#discussion_r1067624306:

for another runtime i made all i/o non-blocking to handle thread termination requests. but i don't think it's an appropriate approach for WAMR.

I wonder if that's something we could actually implement that at least for the poll_oneof() syscall - that function is being used for things like sleep in wasi-libc so even though not all blocking operations are covered, we can already satisfy a lot of usecases for platforms where signal-based implementation is not available. @yamt @wenyongh I'd like to know your thoughts and concerns about that.

yamt commented 1 year ago

From #1869 (comment):

for another runtime i made all i/o non-blocking to handle thread termination requests. but i don't think it's an appropriate approach for WAMR.

I wonder if that's something we could actually implement that at least for the poll_oneof() syscall - that function is being used for things like sleep in wasi-libc so even though not all blocking operations are covered, we can already satisfy a lot of usecases for platforms where signal-based implementation is not available. @yamt @wenyongh I'd like to know your thoughts and concerns about that.

do you mean to make these sleep-like functionalities wake up periodically internally so that it can check the extra termination conditions?

i suppose it's the only choice for platforms where signal-like functionalities are not available. (that's why i did it that way for another runtime, where one of its main target doesn't have signals.)

loganek commented 1 year ago

do you mean to make these sleep-like functionalities wake up periodically internally so that it can check the extra termination conditions?

Yes, I was a bit confused when you said it's not approperiate approach for WAMR, but looks like it was just a misunderstanding, and we're on the same page. We'll implement that in addition to signals then.

wenyongh commented 1 year ago

From #1869 (comment):

for another runtime i made all i/o non-blocking to handle thread termination requests. but i don't think it's an appropriate approach for WAMR.

I wonder if that's something we could actually implement that at least for the poll_oneof() syscall - that function is being used for things like sleep in wasi-libc so even though not all blocking operations are covered, we can already satisfy a lot of usecases for platforms where signal-based implementation is not available. @yamt @wenyongh I'd like to know your thoughts and concerns about that.

do you mean to make these sleep-like functionalities wake up periodically internally so that it can check the extra termination conditions?

i suppose it's the only choice for platforms where signal-like functionalities are not available. (that's why i did it that way for another runtime, where one of its main target doesn't have signals.)

Sounds reasonable, for libc-wasi and internal code, we might change the implementation.

But for user developed native library, it makes me a little confused: should runtime interrupt the thread running into native API by signal-like functionality, will it be not so friendly to developer? Should the developer be responsible for his behaviors? Or we just make this an option, well document it, and let developer choose whether to enable it or not?

yamt commented 1 year ago

do you mean to make these sleep-like functionalities wake up periodically internally so that it can check the extra termination conditions?

Yes, I was a bit confused when you said it's not approperiate approach for WAMR, but looks like it was just a misunderstanding, and we're on the same page. We'll implement that in addition to signals then.

i said so about making "all i/o non-blocking". but what you are talking about is something less extreme, right?

yamt commented 1 year ago

From #1869 (comment):

for another runtime i made all i/o non-blocking to handle thread termination requests. but i don't think it's an appropriate approach for WAMR.

I wonder if that's something we could actually implement that at least for the poll_oneof() syscall - that function is being used for things like sleep in wasi-libc so even though not all blocking operations are covered, we can already satisfy a lot of usecases for platforms where signal-based implementation is not available. @yamt @wenyongh I'd like to know your thoughts and concerns about that.

do you mean to make these sleep-like functionalities wake up periodically internally so that it can check the extra termination conditions? i suppose it's the only choice for platforms where signal-like functionalities are not available. (that's why i did it that way for another runtime, where one of its main target doesn't have signals.)

Sounds reasonable, for libc-wasi and internal code, we might change the implementation.

But for user developed native library, it makes me a little confused: should runtime interrupt the thread running into native API by signal-like functionality, will it be not so friendly to developer? Should the developer be responsible for his behaviors? Or we just make this an option, well document it, and let developer choose whether to enable it or not?

it's reasonable to make it an option.

a developer can make his functions signal-termination-safe and mark it so in the corresponding NativeSymbols. the runtime can terminate them with a signal where it's available.
besides that, a developer can implement graceful termination mechanism like periodic checks in his functions. we can provide an api for this. (eg. should_terminate())
otherwise, a function can't be terminated. (the default behavior)

loganek commented 1 year ago

but what you are talking about is something less extreme, right?

Yes, I'm only talking about the poll_oneof function (at least for now)

Or we just make this an option, well document it, and let developer choose whether to enable it or not?

Yeah, I think we discussed it a few comments above; make it optional sounds reasonable to me.

yamt commented 1 year ago

is anyone still working on this?

eloparco commented 1 year ago

is anyone still working on this?

Not currently, on my side at least. We started the effort on this branch bytecodealliance:dev/interrupt_block_insn. So, at the moment, the new tests from the proposal wouldn't pass.

yamt commented 1 year ago

is anyone still working on this?

Not currently, on my side at least. We started the effort on this branch bytecodealliance:dev/interrupt_block_insn. So, at the moment, the new tests from the proposal wouldn't pass.

ok. thank you for an update. are you (or someone) still willing/planning to work on this?

eloparco commented 1 year ago

are you (or someone) still willing/planning to work on this?

Unfortunately, not in the short term, I don't have any time scheduled to work on it. Feel free to pick it up if you want.

bytecodealliance / wasm-micro-runtime

proc_exit and traps do not stop thread executing blocking instructions #1910