your-diary commented 1 year ago

Motivation

Currently, MAX_SEGMENT_SIZE is hard-coded in Speech class. The problem is the default value 100 is too small in some situations (related comment: #21).

This PR adds -s/--max-segment-size flag to let users customize the value of Speech.MAX_SEGMENT_SIZE.

What's Changed

Removed Speech.MAX_SEGMENT_SIZE class field.
Instead, Speech class now has the max_segment_size instance field and the constructor receives it. (Because the signature of the constructor is now def __init__(self, text, lang, max_segment_size = 100):, Python API is backward compatible.)
Speech.splitText() static method now additionally receives max_segment_size (because it cannot access self).
Added -s/--max-segment-size <positive_int> flag to specify the value of Speech.max_segment_size. The default value is 100 (for backward compatibility).
Added check_positive() function which validates the argument of -s/--max-segment-size flag. Also added its unit tests.
Added a relatively long comment to SpeechSegment.play(). Logic isn't changed at all. This is because, in my environment (M1 mac), the existing unit tests didn't pass.

desbma commented 1 year ago

Before I consider merging this, you need to convince me why this is useful.

Currently, MAX_SEGMENT_SIZE is hard-coded in Speech class. The problem is the default value 100 is too small in some situations (related comment: #21).

This is quite vague. A low maximum segment size increases cache hit rate, and if I remember correctly, higher segment size can cause errors from the Google API.

What is your precise use case that requires a varying segment size?

your-diary commented 1 year ago

I'm making a Siri-like (or Alexa-like) AI assistant using OpenAI API (aka ChatGPT) and GoogleSpeech. In my implementation, each response to users is at most 200 characters. So, if it exceeds 100 characters, speech generated by GoogleSpeech sounds very strange because

It contains a gap every 100 characters.
Even if I somehow remove the gap, accent and intonation don't sound as natural as the speech with higher MAX_SEGMENT_SIZE. (It is because English word is pronounced differently depending on whether it is located at the beginning, middle or end of a sentence and, in the current implementation, 1st, 101st, 201st, ... words are thought to be at the beginning of sentences.)

And, in my use case, this doesn't matter almost at all, because every AI-generated response differs from the other responses:

A low maximum segment size increases cache hit rate

And finally,

higher segment size can cause errors from the Google API

, but as far as I tested with MAX_SEGMENT_SIZE == 200, error rate wasn't increased. Actually, I'm everyday using my fork with MAX_SEGMENT_SIZE == 200 in production environment and no problem has occurred.

Even if a larger MAX_SEGMENT_SIZE is sometimes problematic, it is a responsibility of users. Conservative users should set smaller MAX_SEGMENT_SIZE or use the default value (100) and advanced users can set larger MAX_SEGMENT_SIZE on a case-by-case basis.

In the current implementation, all sort of users with many kinds of use-cases/backgrounds are forced to use MAX_SEGMENT_SIZE == 100. This PR exposes an interface to customize it while preserving the current behavior (MAX_SEGMENT_SIZE == 100) as the default.

desbma commented 1 year ago

Actually, I'm everyday using my fork with MAX_SEGMENT_SIZE == 200 in production environment and no problem has occurred.

Interesting, I don't remember exactly why the value of 100 was chosen, but I did a quick test on Google Translate and the text is cut around length 120-180 depending on the punctuation.

Even if a larger MAX_SEGMENT_SIZE is sometimes problematic, it is a responsibility of users. Conservative users should set smaller MAX_SEGMENT_SIZE or use the default value (100) and advanced users can set larger MAX_SEGMENT_SIZE on a case-by-case basis.

Why don't we just set the hardcoded value to 200, since that seems to work well ?

your-diary commented 1 year ago

Why don't we just set the hardcoded value to 200, since that seems to work well ?

I think that is also a good fix for my current use-case (though someone or me of tomorrow wants to set 300 or 400 or ...).

(By the way, please note someone who wants to set MAX_SEGMENT_SIZE == 300 can directly edit the local __init__.py file but upgrading Python version and re-install google_speech for that version resets MAX_SEGMENT_SIZE to the default 100.)

desbma commented 1 year ago

The thing is, I'm not sure it makes sense to expose the segment size to the user.

The value is here because unlimited text will cause Google API errors. So we have to use a value that is as high a possible to avoid weird pauses, and as low as required to not get errors.

I don't see any good reason to delegate that choice to the user, that seems lazy to me, "we don't really know, so choose a value yourself and good luck".

I'd rather increase the segment size constant if we are sure it works.

Did you made tests with higher values to see if you get errors with segment size of 250, 300 ?

your-diary commented 1 year ago

While I still personally think exposing a customizable API makes sense (for, ah maybe..., advanced users), I do agree with

I don't see any good reason to delegate that choice to the user, that seems lazy to me, "we don't really know, so choose a value yourself and good luck".

to make this app user-friendly for every level of users.

Did you made tests with higher values to see if you get errors with segment size of 250, 300 ?

No. But I can perform simple statistical tests if you want me to and if you give me one or two weeks.

Edit:

I'd rather increase the segment size constant if we are sure it works.

One of my concern is it will break backward-compatibility.

desbma commented 1 year ago

But I can perform simple statistical tests if you want me to and if you give me one or two weeks.

That would be great yes.

One of my concern is it will break backward-compatibility.

If we bump the value as much as possible without generating API errors, this should not break anything. The command line and internal GoogleSpeech API remains and behaves the same.

your-diary commented 1 year ago

That would be great yes.

OK, I'll be back in one or two weeks.

your-diary commented 1 year ago

I performed simple statistical tests, using testing/try_variety_of_max_segment_size branch of my fork.

The Rust code I wrote for the tests can be found here.

How I Performed Tests

Set MAX_SEGMENT_SIZE to infinity (just for convenience).
Execute ./__init__.py ${SCRIPT}_${INDEX} where ${SCRIPT} is a fixed string with the length around 95 and ${INDEX} is a consecutive integer to disable the cache. The total length of ${SCRIPT}_${INDEX} is around 100.
Repeat the step 2 many times (n = 621) with 1000ms sleep between any two executions.
Repeat the step 2 and 3 for ${SCRIPT} of the length around 105, 115, 125, ... (i.e. the total length is 110, 110, 120, ...)

Results

Firstly, if the length of an input is greater than or equal to 200, the API always return 400 BAD REQUEST, meaning we shall not set MAX_SEGMENT_SIZE >= 200.

Secondly, both MAX_SEGMENT_SIZE = 100 (default) and MAX_SEGMENT_SIZE = 195 worked well. All of the API calls with 100 <= MAX_SEGMENT_SIZE < 200 returned 200 OK. The execution time slightly increased as MAX_SEGMENT_SIZE became larger though (see the images below).

Screenshot 2023-05-29 at 23-20-44 Untitled spreadsheet

Screenshot 2023-05-29 at 23-36-14 Untitled spreadsheet

Conclusion

I think we can safely set MAX_SEGMENT_SIZE = 195 (or more conservatively 190 or 150, etc.?).

And the original PR shall not be merged because the only valid values of MAX_SEGMENT_SIZE are ranged only over the very small range [100..199].

desbma commented 1 year ago

Wow thanks for the in depth analysis with graphs, I did not expect so much work.

All of the API calls with 100 <= MAX_SEGMENT_SIZE < 200 returned 200 OK.

I think we can safely set MAX_SEGMENT_SIZE = 195 (or more conservatively 190 or 150, etc.?).

Why not 199, since only >=200 seems to generate errors ?

your-diary commented 1 year ago

I refined my test and just confirmed MAX_SEGMENT_SIZE = 200 works though MAX_SEGMENT_SIZE >= 201 does not.

n = 711 this time. No error was returned for MAX_SEGMENT_SIZE = 200.

Screenshot 2023-05-30 at 09-58-42 Untitled spreadsheet

I also tested Unicode characters to confirm the API is Unicode-aware. This Japanese string of the length 200 works though appending a single character makes the API return 400 BAD REQUEST.

$ ./google_speech/__init__.py --max-segment-size 1000 -v debug -l ja '昔あるところに、月にもお日さまにも増して美しい一人娘をお持ちの王さまとお妃さまがおりました。娘はたいそうおてんばで、宮殿中の物をひっくり返しては大騒ぎをしていました。気まぐれでわがままなこの娘のことを昔あるところに、月にもお日さまにも増して美しい一人娘をお持ちの王さまとお妃さまがおりました。娘はたいそうおてんばで、宮殿中の物をひっくり返しては大騒ぎをしていました。気まぐれでわがままなこの娘のことは'

So the new conclusion is it seems we can safely set MAX_SEGMENT_SIZE = 200.

Why not 199, since only >=200 seems to generate errors ?

The reason why I said

I think we can safely set MAX_SEGMENT_SIZE = 195 (or more conservatively 190 or 150, etc.?).

is just to be conservative. As far as I tested, MAX_SEGMENT_SIZE = 200 works. However, I performed the tests in Japan, whose internet connection is of high quality, and I didn't performed the tests thoroughly (i.e. n is only 500 or so, and only Japanese and English was tested). So the error rate may increase depending on the countries users live in, etc. compared to when MAX_SEGMENT_SIZE = 100.

Again, as far as I tested, MAX_SEGMENT_SIZE = 200 works. So I don't think it is a bad idea to set MAX_SEGMENT_SIZE = 200. Being conservative is just one idea.

desbma commented 1 year ago

The 400 errors are caused by a server side check on Google servers, they are not impacted by the connection quality. Unless their servers have different logic for different countries, if 200 works in your tests, it should work in all conditions.

I will settle with 200 then: e46f68168d6cca91d6255d2ad1e63498f19dbb88.

Anyway thank you for the thorough analysis 👍

desbma commented 1 year ago

A friendly warning though: this project is half maintained, if you want to use Google TTS for anything serious, you may need to use a bigger project with more active maintenance and community, like gTTS.

your-diary commented 1 year ago

Thank you for discussing with me and for the quick fix!

desbma / GoogleSpeech

Added `-s/--max-segment-size` flag to change the value of `MAX_SEGMENT_SIZE`. #27

Motivation

What's Changed

How I Performed Tests

Results

Conclusion