Fixing timing of index events without causing crackling

mltony commented 1 year ago

This is my second attempt of trying to fix indexing without regressing sound quality. This supercedes my previous PR #94. This would fix #22 if accepted.

My investigation

I digged into the issue of crackling and here is my best understanding why it happens.

IBM TTS provides to the driver a buffer that can hold 3300 sampleds.
Eloquence outputs sound in chunks of that size.
For string "Rate 0" spoken at rate 0, observed chunk sizes are 3300, 3300, 3300, 11. Notice the last short chunk.
My first attempt (PR #94) sends those chunks to NVDA player as they arrive. The short chunk is is sent separately without being combined with any other chunks.
Apparently Google search revealed that this is a flaw of winMM API, that it doesn't handle short chunks well, thus causing crackling. Why crackling doesn't happen at current master version? After consuming chunk of size 11, current version doesn't send the short chunk to player right away, but rather buffers it and eventually combines with the following chunk, which happens to be just silence. At the cost of not catching the precise moment when 11 chunk has completed playing and thus degrading index firing precision. Even though NVDA is going to inevitably switch from winMM to Wasapi, I thought it is still good to have a better understanding of crackling and maybe this will also help in some way with future wasapi versions.

How this PR fixes indexing without introducing crackling.

Indexing is fixed in the similar way to my previous PR #94, by setting onDone callback to player.feed() call. If more than 1 index is sent together, then onDone callback would trigger all of them, so no index event will be lost. On a high level, crackling problem is fixed by delayed flushing of audio chunks. When we come across chunk of size 11 - as in my example above, we would find ourselves in a situation where the previous chunk hasn't been flushed yet (or in other words player.feed() hasn't been called on it). This allows us to combine chunk of size 11 with the previous chunk of size 3300 and flush both of them in a single call to player.feed(). More precisely, when we receive a new chunk of audio in eciCallback() function, we would check how much audio has already been buffered (in audioStream) and how much audio is coming in in the new chunk (variable lp). We will only flush audioStream if both values are at least samples*2 or in other words if both values are at least as big as the buffer we use to communicate with eloquence. If this condition is satisfied, we will flush old contents of audioStream (which is long enough not to cause crackling), and then we will truncate audioStream and buffer in it incoming audio chunk. One more case when we would end up flushing is when there is an index event recorded (e.g. len(indexes) > 0) - in this case we have to flush to respect accurate index timing. This complex condition is written in _ibmeci.py lines 369...375 and I will refer to it as "the condition" below. To illustrate this further, consider "rate 0" example I mentioned above. Eloquence generates audio in 4 chunks:

Chunk 0, 3300 samples,
Chunk 1, 3300 samples,
Chunk 2, 3300 samples,
Chunk 3, 11 samples.
Index
Chunk 4 (847 samples)
EndOfString index Here is how new algorithm would process this sequence:
1. In the beginning audioStream is empty.
2. Chunk 0 (size 3300) comes in in eciCallback(). Since audioStream is empty, the condition evaluates to False, and we don't flush anything and simply store new chunk of audio in audioStream.
3. Chunk 1 (size 3300) comes in. Now the condition evaluates to True, since both buffered data and incoming chunk are equal to buffer size. So we call pleyStream() to flush contents of audioStream, and after that we clear audioStream and store new chunk in it.
4. Chunk 2 (size 3300) comes in. Similar to previous step, the condition still evaluates to true so we execute exactly same steps in the same order.
5. Chunk 3 (size 11) comes in. Now the condition is false, since incoming chunk is too small. We append new chunk to audioStream and we don't flush it. At this point, audioStream contains two recent chunks, or 3311 samples.
6. Index event comes in. We just store it in global variable indexes.
7. Chunk 4 (size 847) comes in. Now the condition evaluates to True, since we have index stored in global variable indexes. So we call playStream to flush audioStream, which has 3311 samples at this moment - please note that we just prevented crackling by sending small chunk together with the previous one. Then we again clear audioStream and store new chunk there.
8. EndOfString index comes in. When this happens we just call playStream() to flush any data still present in audioStream, so we would flush 847 samples.

Alternatives considered

Algorithm presented above is more complicated and harder to reason about. Another approach I considered was to play with buffer size. However no matter what buffer size would be, it will always be possible to find an example, where the last chunk generated for a phrase is going to be small, and thus this will not ultimately solve crackling problem. So the algorithm in this PR seems to be the only solution that effectively prevents sending too small chunks to player, while also respecting timing of index events.

Testing performed

Tested on NVDA version 2023.1:
- Tested on all usecases mentioned in #22, observed no crackling.
- Tested in combination with Phonetic Punctuation 1.6; IBM TTS fires index events accurately, punctuation earcons are played at the right times.
Tested on NVDA alpha 28430
- Tested on all usecases mentioned in #22, observed no crackling.
- Even though Phonetic Punctuation hasn't been yet ported, I verified that index events fire properly by single letter navigation in Windows explorer (when index events are not fired, NVDA doesn't speak the name of the item after single letter navigation).

davidacm commented 1 year ago

Hi, this seems as a very good solution!

I was testing it, but this solution adds unwanted pauses, you can test it for example spelling a word. It's also noticeable in other places where punctuation is present. Is more noticeable on the NVDA's alpha versions.

@Mohamed00

Mohamed00 commented 1 year ago

Yep, still noticeable for me here as well.

mltony commented 1 year ago

I've spent quite some time investigating longer pauses when spelling. According to my measurement in my version there is 38% slowdown when spelling a word (695 ms vs 502 ms). I have so far found two reasons contributing to slowdown:

Late index firing accounts for 10% slowdown. Apparently spelling a word in NVDA relies on CallbackCommand, which relies on indexing. Since in current master version indexes fire at the start of utterance, it gives master version unfair headstart. Most likely nothign can be done about this since this is the main goal of this PR to make indexes fire at their precise time.
More frequent usage of player.feed() call. Apparently eloquence outputs 4 commands for each spelled letter: waveform, index, waveform, index. Master version merges two waveforms together, while in this PR I have to make two separate calls to player.feed() - and there is no way to work around that. According to my measurement having two separate calls increases latency by another 11%. I performed this test by a short script manually invoking player'feed() simulating patterns of baseline version and this PR. In actual synth code the delay might be even higher, since I rely on onDone function that actually executes in the thread that actually plays audio, adding some more delay, by I estimate this extra delay to be no more than 1-2%, thus total delay from more extensive usage of player.feed() is no more than 13%. Most likely nothing can be done about this item iether, since in order to fire indexes at correct time we have to catch the moment in between of the two waveforms, and with current WavePlayer framework this can only be performed by two separate calls to player.feed() and using onDone argument. So far I have explained 23% slow down, but I am struggling to find explanation for the remaining 15% slowdown. I am pretty sure my PR doesn't introduce any artificial pauses in the speech. So at this point my guess would be probably the remaining 15% slowdown can be explained by iether interference of 1 and 2, or maybe my measurement of 1 and 2 wasn't perfect and it's possible that they contribute more than I measured. I have another idea to explore that I would consider rather a dirty hack, so wondering what you'd think of that. As I mentioned before eloquence outputs speech for spelling in two waveform buffers per word. I can drop the second buffer and thus reduce total time it takes to spell a word. My preliminary measurement shows it can only make spelling about 10% faster, so it won't be anywhere close to master version. I will try to have another look at this next weekend, but at this point I am out of any other good ideas.

davidacm commented 1 year ago

Hi, I would not like to add latency to the synth. In Espeak the problem when spelling is not noticeable, I will analyze the reasons. I haven't had time to test lately due to lack of time, but I will definitely do it. I spent a lot of time in the past fixing issue #22, each time I succeeded it caused other issues.

There is an alternative that I'm considering of now. an extra setting to use the current approach, or the accurate indexing method. Users who require precision in the indexes can activate the accurate approach, previously warning that unwanted long pauses could be generated. Since not all users require index precision, this could be an alternative if a good fix cannot be found.

ultrasound1372 commented 1 year ago

If this introduces latency via the expedient of 3300 or more samples being buffered, at 11025 Hz this would introduce a delay of near 300ms. This is unacceptable if it actually happens this way, although I suppose that depends on how much faster than real time ECI actually is. I think we should stop trying to work around flaws in WinMM and look at the WASAPI alternative. Does feeding small chunks also cause crackling? What about the direct adaptation of passing a ctypes pointer to the buffer rather than storing it in a bytes object?

davidacm commented 1 year ago

The latency is not perceptible except in some cases, it would never reach 300ms, it would be very noticeable.

I don't think it's a good idea to focus only on wasapi, that would be forgetting all the users who for some reason cannot update to the latest version of NVDA.

ultrasound1372 commented 1 year ago

We'd leave the existing code as a backward compatibility layer or maybe an option, since the WASAPI build isn't even out of alpha yet I agree that we can't just drop all WinMM support. I'm just saying we might just require the WASAPI build for accurate indexing.

mltony commented 1 year ago

I think I found why this pR was causing extra pauses when spelling. Just pushed commit fixing this. In current master branch endStringEvent() function calls idlePlayer() with a delay of 300ms. Apparently for whatever reason in master branch we don't rely on idlePlayer() being called for speech buffer to be actually played. However I changed some code flow around processing events from DLL in eciCallback(), that apparently makes us to rely on idlePlayer() to be called. Hence, this PR was previously adding 300ms pause to each letter while spelling. IDK why this 300ms pause has been added in the first place. So far I have reduced this pause to less than 1ms and I didn't observe any side effects, but please LMK if there are any. Without 300ms pause spelling is definitely much faster then in the first version of this PR, but still a bit slower than master branch. This can be probably attributed to different usage of NVDA Player API - I explained this reason in my previous comments. However now the rate of spelling seems to be on par with eloquence_threshold, so this makes me think that with this latest commit I have reached highest possible rate of spelling. @davidacm, please let me know whether you find this spelling rate slowdown appropriate, or whether we should still move towards your idea to allow users to select from accurate indexing vs fast spelling.

davidacm / NVDA-IBMTTS-Driver