Pa_StopStream() should not use a finite timeout to join threads

Currently, some, but not all, PortAudio Host APIs will use a finite timeout when joining the audio streaming thread in Pa_StopStream(). Examples include:

MME:

https://github.com/PortAudio/portaudio/blob/18a606e1f928852bfc29639d9539ae74d37b5dee/src/hostapi/wmme/pa_win_wmme.c#L3453-L3460

DirectSound:

https://github.com/PortAudio/portaudio/blob/18a606e1f928852bfc29639d9539ae74d37b5dee/src/hostapi/dsound/pa_win_ds.c#L3092

In contrast, WASAPI uses an unbounded wait:

https://github.com/PortAudio/portaudio/blob/18a606e1f928852bfc29639d9539ae74d37b5dee/src/hostapi/wasapi/pa_win_wasapi.c#L4517-L4520

And so does WDM-KS:

https://github.com/PortAudio/portaudio/blob/18a606e1f928852bfc29639d9539ae74d37b5dee/src/hostapi/wdmks/pa_win_wdmks.c#L6289-L6291

(I have not looked into non-Windows host APIs.)

I believe it is incorrect to use a finite timeout, and this can lead to race conditions and undefined behavior. MME and DirectSound should be changed to wait forever.

The reason why I'm advocating this is because there is no telling how long the streaming thread may be blocked for. If the streaming thread suddenly resumes execution after Pa_StopStream() gave up, then the resources that the streaming thread is using may have been freed already - in fact, the entire PortAudio DLL may have been freed, as everything has been cleaned up as far as the caller code is concerned. This leads to hilarity in the form of undefined behavior.

Now, is it likely that the streaming thread will be blocked for that long? No, but "unlikely" is not the same as "impossible", even in the absence of any deadlock/bugs. A non-real-time OS is allowed to suspend the execution of a thread indefinitely. For example, this could happen if the machine is overloaded, if the relevant memory pages have been swapped out and are taking some time to swap back in, if the process is suspended, or even if the computer is going to standby/sleep. In all these cases, the bounded timeout runs the risk of triggering undefined behavior. In my opinion this is not reasonable.

I presume that the reason why this timeout was introduced is to protect against the case where the streaming thread may be deadlocked. I don't think this makes sense because this is not something that is supposed to occur in a valid program. If the streaming thread deadlocks, then things are already going off the rails and there is no point in trying to "cover up" the problem - in fact, this is doing the developer a disservice because it only serves to hide bugs. This is precisely what happened in dechamps/FlexASIO#235, where PortAudio DirectSound ended up "hiding" a deadlock issue in user code, but PortAudio WASAPI made it obvious by deadlocking the entire app (as it should!). The behavior of waiting with a timeout and returning an error would be reasonable if the failure mode was recoverable, but strictly speaking it isn't, because (1) the code cannot know if the streaming thread is truly deadlocked and (2) even if it is, the streaming thread will linger around forever and leak various resources anyway.

For these reasons, I believe the correct behavior is to wait forever.

The counter argument is that if the stream freezes (as has been reported in multiple recent issues and PRs) then an infinite timeout results in your application locking up and your user having no way to save their work. (I recall we merged patches for bounded timeouts in JACK and/or ALSA recently).

My position is that bounded timeouts should be used by all host APIs under all conditions. If a stream fails to stop (or close due to a timeout then it should be put into an permanently failed state so that the worst case is that it stops working and (temporarily) leaks resources, but never causes a complete lock-up of the client.

The counter argument is that if the stream freezes (as has been reported in multiple recent issues and PRs) then an infinite timeout results in your application locking up and your user having no way to save their work.

Yes, but these timeouts are not the right fix for this.

Using a timeout when joining a thread is not a proper fix. It's papering over the problem and creating new ones in the process (specifically, the thread could suddenly resume execution at any point, and if it does it is likely to trigger undefined behavior because the state it was using is gone - in fact, the entire PortAudio DLL could be gone).

The correct fix is to actually figure out why things are getting stuck and fix the root cause. But it's unlikely anyone will get the opportunity to do that if PortAudio keeps hiding underlying issues like this.

My position is that bounded timeouts should be used by all host APIs under all conditions.

If an Host API has to use a timeout (which, again, I don't think it should unless all other avenues have been exhausted), then the only safe way to do it is to apply the timeout inside the streaming thread on the particular system audio API that's blocking it (e.g. when the audio streaming thread is waiting for some event to be signaled). That I'd be fine with, because that can usually be done safely without introducing potential undefined behavior, contrary to the "well I don't know what's going on with that thread, so YOLO ¯\(ツ)/¯" approach.

Ironically, from a cursory look at the code, the DirectSound Host API already seems to do that, because the only blocking call made from the streaming thread is using a timeout:

https://github.com/PortAudio/portaudio/blob/18a606e1f928852bfc29639d9539ae74d37b5dee/src/hostapi/dsound/pa_win_ds.c#L2782-L2784

Which means that even if DirectSound itself is stuck (i.e. the playback/capture cursors don't advance), the streaming thread will still be able to notice that it's being requested to stop and cleanly exit. Which means the timeout on the thread join truly is pointless - it's just opening potential for undefined behavior for no benefit.

Same story for MME it seems:

https://github.com/PortAudio/portaudio/blob/18a606e1f928852bfc29639d9539ae74d37b5dee/src/hostapi/wmme/pa_win_wmme.c#L2848

So, for MME and DirectSound, the streaming thread is already designed in such a way that it cannot get stuck. The timeout on the thread join in Pa_StopStream() is therefore unnecessary and only serves to hide problems (e.g. the user callback getting stuck, or an actual PortAudio bug, none of which are things PortAudio should try to hide), while at the same time opening potential for undefined behavior if the stuck thread suddenly decides to resume.

I recall we merged patches for bounded timeouts in JACK and/or ALSA recently

I'm not familiar with JACK or ALSA code, but if these timeouts are implemented correctly (e.g. streaming thread code timing out waiting for some audio event and then cleanly cleaning up if it's exceeded), then I'd be fine with it (with the caveat that I don't think we should hide deadlocks until we have at least tried to root cause them). Using a timeout on a thread join and then just moving on if it's exceeded is not a correctly implemented timeout.

The correct fix is to actually figure out why things are getting stuck and fix the root cause.

Yes, that would be the ideal. We should track down the root cause. And a hung thread is easier to detect when you have a finite timeout and report an error.

Using a timeout on a thread join and then just moving on if it's exceeded is not a correctly implemented timeout.

I agree. Timeouts should not be ignored. When a timeout is detected it should result in an error paTimedOut being returned to the application. That is generally doable for Pa_* functions. But it's hard when the timeout occurs in the callback thread.

@dechamps You're jumping to a lot of conclusions here. Could you please at least try to assume good faith and see where I'm coming from?

Here is the WMME PR that I was talking about:

https://github.com/PortAudio/portaudio/pull/889

If a stream freezes due to operating system or driver behavior there is no root cause analysis possible. The best that can be done is log an error and implement a best effort error recovery. In my opinion, a timeout is better than a hung application.

I agree that if there is a PortAudio bug, it should be fixed. I am not for a moment suggesting that PortAudio should paper over bugs in its own code. (How it handles a user callback that never returns is a separate question, with its own discussion). But I am assuming that PortAudio has no bugs relevant to this issue and the hangs are caused by the OS ceasing to wake PortAudio -- this is a real thing that has been observed.

the thread could suddenly resume execution at any point, and if it does it is likely to trigger undefined behavior because the state it was using is gone - in fact, the entire PortAudio DLL could be gone).

I agree that timeouts need to be handled sensitively. If you read my response again you will see that I suggested putting the stream into an error state -- that would include never deleting associated storage.

Timeouts should not be ignored. When a timeout is detected it should result in an error paTimedOut being returned to the application.

I agree that timeouts should not be ignored. They should be handled in some "correct" manner. Entering a safe "failed" (or "error" or "detached" or "zombie") state is one option.

I don't think it's always as simple as returning an error though. As a rule, any function which drives a state change to quiescent state (equivalent of a destructor in C++) should always succeed to effect the state transition (in C++ this is the "destructors don't throw exceptions" rule). That would include Pa_StopStream and Pa_CloseStream. If you violate this rule you are essentially defining a code path where the stream can never be stopped or closed. Maybe you prefer that to my preferred persistent "failed" state?

As a rule, any function which drives a state change to quiescent state should always succeed to effect the state transition

I like that rule. But if stopping the stream involves join()ing a thread and the thread is hung then what can you do? You cannot really "succeed".

I think PA should clean up as best it can, set the state to STOPPED and then return paTimedOut. Then the app knows that something is hosed and can warn the user to save their work and maybe restart the app or the device.

Are you suggesting that Pa_StopStream() should return paNoError when a join() times out?

Maybe you prefer that to my preferred persistent "failed" state?

Is that the state where you do not delete the structures that might be in use by the hung threads? I think that is fine for the reasons you and Etienne described. It is better to leak a little memory than to lock up and the app and require a hard kill.

@dechamps You're jumping to a lot of conclusions here. Could you please at least try to assume good faith and see where I'm coming from?

Here is the WMME PR that I was talking about:

889

I have no problem with the code in that PR. It looks correct to me.

I think we are talking past each other Ross.

To be clear:

I am fine with an Host API using timeouts while waiting for audio events to occur (e.g. waiting for buffer space to become available), if it turns out to be necessary (looks like in the case of MME it is). That can be done without triggering undefined behavior and proper cleanup can still be done even if the timeout is hit.

I am NOT fine with an Host API using timeouts to join its streaming thread, because that cannot be done without potentially triggering undefined behavior.

Ross, the reason why I reacted the way I did is because you answered my original post, which is titled (emphasis mine):

Pa_StopStream() should not use a finite timeout to join threads

And has the following as its first sentence:

Currently, some, but not all, PortAudio Host APIs will use a finite timeout when joining the audio streaming thread in Pa_StopStream(). Examples include:

Followed by code examples that show timeouts when waiting for the streaming thread to exit in some Host APIs.

You answered that with:

The counter argument is […] My position is that bounded timeouts should be used by all host APIs under all conditions.

Which I naturally interpreted as "All Host APIs should use bounded timeouts when joining the streaming thread", because that's the interpretation that seemed to make sense in the context of what I wrote.

I vehemently disagree with that, and so I wrote a thorough rebuttal.

But now, in your last comment you linked to a PR (#889) that is purely concerned with timeouts waiting for events within the streaming thread (which, again, is fine by me) - not the timeout used to join the streaming thread. This leaves me confused and makes me wonder if I misinterpreted your position.

So, do we actually disagree on this or not?

Just so that things are perfectly clear, my fixes for this issue are in #920. You'll note that I'm only touching the timeout used to join the MME and DirectSound streaming threads - I am NOT touching the timeouts used in the streaming thread to wait for buffers to become available, nor do I have any desire to. Do you object to #920 or not?

I agree that timeouts should not be ignored. They should be handled in some "correct" manner. Entering a safe "failed" (or "error" or "detached" or "zombie") state is one option.

I don't think we need to have "detached" or "zombie" states.

There are two kinds of timeouts we are discussing here: thread join timeouts, and timeouts inside the streaming thread waiting for some event to occur.

My position, which hasn't changed, is that thread join timeouts should not exist because they cannot be implemented safely.

I am fine with timeouts waiting for events in the streaming thread, because PortAudio can safely recover from those: indeed, if these occur, then a call to Pa_StopStream() will still cleanly close all handles and will be able to join the stream thread, without any deadlocks, undefined behavior nor any resource leak - just like the happy case.

MME and DS currently use both kinds of timeouts. My proposal is to get rid of the former but keep the latter.

Is that the state where you do not delete the structures that might be in use by the hung threads?

That cannot be done safely in the general case. The application may decide to unload the PortAudio DLL right after Pa_StopStream(), Pa_CloseStream() and Pa_Terminate(). If the streaming thread is still running by that point, all bets are off - the streaming thread is literally running code that is not there anymore, and hilarity will ensue. That is why it is so important that Pa_StopStream() does not return before the streaming thread is joined.

Again, I do not think PortAudio should try to cover up "hung threads". We should prevent threads from getting hung in the first place. MME and DS already have safe timeouts inside the streaming thread to prevent it from getting hung, so all that's left to do is getting rid of the unsafe timeouts on thread join, which are completely unnecessary. I am honestly still confused as to whether Ross agrees with this position or not.

PortAudio / portaudio

Pa_StopStream() should not use a finite timeout to join threads #919

889