DirectSound full duplex produces a single glitch shortly after stream starts

dechamps commented 1 year ago

The DirectSound Host API is glitchy when used in full duplex mode (i.e. when both input and output devices are used in a single PortAudio stream) when using reasonable parameters (default frames per buffer, default high suggested latency).

Half duplex is fine.

Example output from paloopback -r48000 -s1024 --inputLatency 240 --outputLatency 240 -w:

************ Mode = Two Streams (Half Duplex) ************
|-   requested  -|-  stream info latency  -|- measured ------------------------------
|-sRate-|-fr/buf-|- in    - out   - total -|- over/under/calls for in, out -|- frm/buf -|-latency-|- channel results -
| 48000 |   1024 |   21.33  240.00  261.33 |    0/   0/ 469,    0/   0/ 470 | 1024-1024 |  297.29 | OK - #errs = 0
| 48000 |   1024 |   21.33  240.00  261.33 |    0/   0/ 469,    0/   0/ 470 | 1024-1024 |  298.04 | OK - #errs = 0
| 48000 |   1024 |   21.33  240.00  261.33 |    0/   0/ 469,    0/   0/ 470 | 1024-1024 |  298.52 | OK - #errs = 0
| 48000 |   1024 |   21.33  240.00  261.33 |    0/   0/ 469,    0/   0/ 470 | 1024-1024 |  295.52 | OK - #errs = 0
| 48000 |   1024 |   21.33  240.00  261.33 |    0/   0/ 469,    0/   0/ 470 | 1024-1024 |  296.77 | OK - #errs = 0

************ Mode = One Stream (Full Duplex) ************
|-   requested  -|-  stream info latency  -|- measured ------------------------------
|-sRate-|-fr/buf-|- in    - out   - total -|- over/under/calls for in, out -|- frm/buf -|-latency-|- channel results -
| 48000 |   1024 |   21.33  261.33  282.67 |    0/   0/ 469,    0/   0/ 469 | 1024-1024 |  295.40 | channel 0 POP 0.460 at 15220, ".\paloopback_0.wav", channel 1 POP 0.453 at 15220, ".\paloopback_1.wav",  - #errs = 0
| 48000 |   1024 |   21.33  261.33  282.67 |    0/   0/ 469,    0/   0/ 469 | 1024-1024 |  299.40 | channel 0 POP 0.460 at 15410, ".\paloopback_2.wav", channel 1 POP 0.453 at 15410, ".\paloopback_3.wav",  - #errs = 1
| 48000 |   1024 |   21.33  261.33  282.67 |    0/   0/ 469,    0/   0/ 469 | 1024-1024 |  299.44 | channel 0 POP 0.460 at 15412, ".\paloopback_4.wav", channel 1 POP 0.453 at 15412, ".\paloopback_5.wav",  - #errs = 2
| 48000 |   1024 |   21.33  261.33  282.67 |    0/   0/ 469,    0/   0/ 469 | 1024-1024 |  298.60 | channel 0 POP 0.460 at 15373, ".\paloopback_6.wav", channel 1 POP 0.453 at 15373, ".\paloopback_7.wav",  - #errs = 3
| 48000 |   1024 |   21.33  261.33  282.67 |    0/   0/ 469,    0/   0/ 469 | 1024-1024 |  296.06 | channel 0 POP 0.460 at 15252, ".\paloopback_8.wav", channel 1 POP 0.453 at 15252, ".\paloopback_9.wav",  - #errs = 4

A quick look at the recorded test signal confirms the issue:

MME passes the same test with flying colors.

PortAudio version: latest master (4bccdfbb68a509e2a96d916046d911959f7d935c)
Windows 11 22H2 22621.1265
Loopback audio device: Virtual Audio Cable 4.67.0.12554

dechamps commented 1 year ago

Actually, closer investigation of the recordings reveal that there is only a single glitch in each stream, happening exactly 20 ms after the start of the test signal. After that single glitch, a 10-second stream appears to be glitch-free:

This likely explains why no-one is complaining about this - a single glitch at the beginning of the stream is arguably benign, although this is still a bug that should be fixed (if only so that we can get proper results from paloopback).

dechamps commented 1 year ago

I was wondering if this could be related to the use of DirectSoundFullDuplexCreate8, but this doesn't seem to be the case: the behaviour is exactly the same if I undef PAWIN_USE_DIRECTSOUNDFULLDUPLEXCREATE.

dechamps commented 1 year ago

If I record the audio going through the loopback device using Audacity at the same time as the paloopback test is running, the glitch appears in the separately recorded audio. This indicates that the glitch is produced in the PortAudio → device (playback) direction.

RossBencina commented 1 year ago

Hi Etienne, please update your issue report to specify Windows version and hardware setup.

Can you consistently reproduce this on multiple systems with differing audio hardware?

I have never observed such behavior myself, nor do I recall anyone report in over the long history of PortAudio (DirectSound was one of the first host API implementations). This makes me think that it is possibly a regression in recent Windows versions, or with specific drivers.

dechamps commented 1 year ago

please update your issue report to specify Windows version and hardware setup.

I already did, see bottom of my first post.

Can you consistently reproduce this on multiple systems with differing audio hardware?

I only tried Virtual Audio Cable so far. I guess I could try other VAC KS modes or some real hardware. The fact that both MME and DS half duplex are working fine would seem to point away from the hardware, though.

I have never observed such behavior myself, nor do I recall anyone report in over the long history of PortAudio

It's possible it was always there but that no-one noticed it - a glitch happening in the first 20 ms could be quite hard to notice in real usage, especially given that the first 20 ms of streamed audio would probably be silence in many scenarios. Also an half-duplex application would not trigger it. I only noticed this problem when I saw paloopback fail.

RossBencina commented 1 year ago

I'm not saying it's not a bug, just that from my point of view PortAudio is not the most likely culprit.

Please at least test with real hardware.

dechamps commented 1 year ago

I'm still investigating but so far I am getting very suspicious there is a problem in the interaction between the DS host API code and the AdaptingProcess() buffer processor code.

https://github.com/PortAudio/portaudio/blob/d07842c1b021a9a05d50da88272e3f0930ecfbe3/src/hostapi/dsound/pa_win_ds.c#L2617-L2626

The return value of PaUtil_EndBufferProcessing() is somewhat vaguely documented as the "number of frames processed". The DS host code implicitly assumes that this is the number of frames that were read from the host input buffer, and also the number of frames that were written to the host output buffer. The DS host code increments the input and output buffer offsets accordingly, so that on the next TimeSlice() call we will process the next portions of both buffers.

The buffer processor business logic that is used here, specifically AdaptingProcess(), ultimately returns as the "number of frames processed" the number of input frames that were consumed. With regard to the DS caller code, this works as long as the number of output frames that were written is always exactly equal to the number of input frames read.

The problem is, I believe I have found a scenario in which that isn't the case. Here's how things actually unfold in reality, assuming framesPerBuffer 1024 and 48 kHz sample rate:

On initialization, the buffer processor initializes framesInTempOutputBuffer to 1024: https://github.com/PortAudio/portaudio/blob/d07842c1b021a9a05d50da88272e3f0930ecfbe3/src/common/pa_process.c#L182-L186
Initially the DS cursors are at 0 and the DS host code polls repeatedly, waiting for cursors to move.
Eventually, the DS read position suddenly moves from 0 to 1500 frames. Meanwhile, the play cursor is at 960 frames.
The DS code sees that 960 frames can be processed on both sides, and moves to lock the buffers and call the buffer processor.
In CopyTempOutputBuffersToHostOutputBuffers() the buffer processor transfers 960 frames of silence to the host output buffer. framesInTempOutputBuffer is now 1024-960=64.
AdaptingProcess() does not read any input frames and does not call the stream callback because there are not enough frames available (960 < 1024).
AdaptingProcess() returns 0 "frames processed", since it did not process any input frames. (It did write some silence into the output host buffer without telling the host code, but that's benign - it just overwrote silence with silence at this point.)
Since PaUtil_EndBufferProcessing() returned 0, the DS code does not move the buffer offsets and everything stays as it is.
The DS code is back to polling. On a subsequent poll, the read position is still 1500 frames, but the play cursor is now at 1440 frames.
The DS code sees 1440 frames are available. It locks the buffers and calls the buffer processor again.
In CopyTempOutputBuffersToHostOutputBuffers() the buffer processor transfers the remaining 64 frames of silence to the beginning of the host output buffer (remember the DS buffer offsets haven't moved), overwriting some of the silence it already wrote in step 5. (This is starting to feel wrong, and things only go downhill from there.) framesInTempOutputBuffer is now 0, and hostOutputChannels[i].data pointers are incremented by 64 frames.
AdaptingProcess() sees that more than 1024 input frames are available. It calls the stream callback on the first 1024 input frames. framesInTempOutputBuffer is now 1024.
AdaptingProcess() calls CopyTempOutputBuffersToHostOutputBuffers() again. CopyTempOutputBuffersToHostOutputBuffers() copies the 1024 frames to the host output buffer after the 64 frames of silence it wrote in step 11 (this is because in step 11 the internal buffer processor hostOutputChannels[i].data pointers were incremented).
AdaptingProcess() returns the number of input frames processed, which is 1024.
The DS code moves the buffer offsets accordingly. It increments the input buffer offset by 1024, which is correct. It also increments the output buffer offset by 1024, which is incorrect because in reality the buffer processor wrote 64+1024=1088 frames into that buffer!
The DS code goes back to polling.
Because the DS write offset is wrong, the next buffer processor call will write the next frames at DS offset 1024, instead of resuming from the correct 1088 offset. This results in the last 64 frames of the previous buffer being trampled on. Hilarity ensues.

I believe this explains the symptoms. I believe the reason why this only happens at the end of the first user buffer likely has to do with the fact that this initial priming phase where framesInTempOutputBuffer is initialized to 1024 in step 1 is (presumably) the only scenario where output frames are written without any input frames being read. Because this has to do with initialization/priming, this would explain why there is only a single glitch and the rest of the stream is fine.

Next step is to come up with a fix. For that I need to understand how that code was originally intended to work. This buffer adaptation stuff is proving to be quite headache-inducing!

PortAudio / portaudio

DirectSound full duplex produces a single glitch shortly after stream starts #770