wiresv commented 1 year ago

The audio playback of the text-to-speech voice is not smooth, and has unnatural pauses while speaking.

Is this solvable, or a limitation of the way the ChatGPT response is served back to the client?

C-Nedelcu commented 1 year ago

This is caused by the text-to-speech API. You basically send it a segment of text and that text will be spoken. If you send two segments, they will be read as different sentences, with a pause inbetween.

Since ChatGPT doesn't send the full reply at once (you surely noticed that the bot's responses are "streamed", you received one word at a time, sometimes fast, sometimes slow), we have a choice:

either wait until the reply is fully received to start speaking. This is what I did at first, but it sometimes took ages to get a reply.
or, start speaking out loud as we go, while the bot's response is still being streamed.

Since I am now using the 2nd option, that means I have to decide on how to segment the response appropriately so that the voice will make actual sentences. So far, I am splitting the bots responses into sentences using punctuation marks (such as commas, dots, exclamation marks, etc.). This significantly reduces the voice's initial delay, it triggers a lot faster than with option 1. However, there will still be blanks depending on the response contents.

Example: Question: "Is Brazil a country?" Answer: "Yes, (long blank here because the next sentence is very long) Brazil is a country in South America that has a population of about 214 million people and contains the Amazon which is the world's most amazing and beautiful forest."

You will hear "Yes" immediately because it is followed by a comma. But then you have to wait until you receive the rest of the sentence in full before the sentence is sent to the text-to-speech queue.

I am working on improving this by using particular stop words that can be used as sentence breaks.

I have determined that the following words could be used as sentence breaks: "so, and, then, therefore, or, because, but, that, which".

Using that technique, it would mean that the response would be spoken a lot faster

Before:

Answer: "Yes, (long blank here because the next sentence is very long) Brazil is a country in South America that has a population of about 214 million people and contains the Amazon which is the world's most amazing and beautiful forest."

After:

Answer: "Yes, (short blank here until the word 'that' is received), Brazil is a country in South America (short blank here until the word 'and' is received) that has a population of about 214 million people (short blank here until the word 'which' is received) and contains the Amazon (short blank here until the word 'and' is received) which is the world's most amazing (short blank here until the end) and beautiful forest."

I could also somehow try to detect whether breaks are needed based on whether the voice is currently active or not.

This will make the bot's voice feel more natural and start speaking faster, which is obviously a problem at the moment.

I have also tried splitting sentences into fixed segments of 5 words each: the voice triggers very quickly, but it sounds extremely bad. Each group of 5 words will be read as a sentence. So it sounds like this:

"Yes, (short blank here), Brazil is a country in. (full stop) South America that has a (full stop) population of about 214 million (full stop) people and contains the Amazon (full stop) which is the world's most (full stop) amazing and beautiful forest."

That's obviously not an option.

bartman081523 commented 1 year ago

@C-Nedelcu I noticed this problem too. Could a workaround be, to send only completed sentences to the TTS Api? Also thank you. Your extension is a great help.

EDIT: Can you change the punctuation of a commata that gets sended to the tts api to end with " - " internally, instead of the commata, so the voice is not raised in pitch? I dont know if " - " is the right tool, but some other punctuation that doesnt raise the tts voice in pitch?

dev-eduka commented 1 year ago

these are unfortunately not options offered by the TTS API I'm afraid

bartman081523 commented 1 year ago

Thank you for that info. I remembered the pitch wrong and was writing from memory. I am sorry.

On another note: I have often the case that a sentence is begun with one or two words, followed by a commata. Could you implement a setting, that can adjust the lenght, if the character count of a sentence is below a certain value, a punctuation is not split in two word packets before sending it to the tts api?

Example: skip_punctuation=0 characters "Hello, [Pause] how are you?" should be (imo) skip_punctuation=20 characters "Hello, how are you?" (without Pause)

or the setting could adjust how many sentence pieces (split by punctuation) get sended at once to the tts api: combine_sentence_pieces=1 "Hello, [Pause] how are you?" should be (imo) combine_sentence_pieces=2 "Hello, how are you?" (without Pause)

bartman081523 commented 1 year ago

ChatGPT gave me this revised version, i dont speak javascript fluenty, but can you have a look?

content.js Line 116

function CN_SplitIntoSentences(text) {
  var sentences = [];
  var delimiters = /[.!?;:]/; // Match sentence delimiters
  var currentSentence = "";

  for (var i = 0; i < text.length; i++) {
    //
    var currentChar = text[i];

    // Add character to current sentence
    currentSentence += currentChar;

    // is the current character a delimiter? if so, add current part to array and clear
    if (delimiters.test(currentChar)) {
      // Only split the sentence if it has more than 75 characters
      if (currentSentence.trim().length > 75) {
        sentences.push(currentSentence.trim());
      }
      currentSentence = "";
    }
  }

  return sentences;
}

Edit: I tested shortly, and this works. If the sentence is below 75 characters, the sentence will not get split.

13

rotemdan commented 1 year ago

As a temporary solution, in the CN_SplitIntoSentences() method I changed:

if (currentChar == ',' || currentChar == ':' || currentChar == '.' || currentChar == '!' || currentChar == '?' || currentChar == ';') {
  ...
}

To:

if (currentChar == ':' || currentChar == '.' || currentChar == '!' || currentChar == '?' || currentChar == ';') {
  ...
}

I simply removed the comma (,) as a sentence separator, and the speech flows much more naturally. Otherwise the choppiness makes the speech difficult to follow through.

The problem with splitting the text over any period character (.), however, is that abbreviations like "Mr. " and "Sgt." would be detected as sentence boundaries and cause pauses. The same happens with decimal numbers such as 3.14.

A more thorough solution would be to use a specialized library that performs sentence segmentation. The best I know of (from my own personal experience), that also supports many languages (it detects the language automatically) is cldr-segmentation.js:

var supp = cldrSegmentation.suppressions.en;
cldrSegmentation.sentenceSplit("I like Mrs. Murphy. She's nice.", supp);
// => ["I like Mrs. Murphy. ", "She's nice."]

C-Nedelcu commented 1 year ago

Thanks for the suggestions.

I still feel that splitting the sentences when commas are used allows for faster bot responses (ie. you dont have to wait until the full sentence is received to start speaking), which means more human-like interactions with less pauses.

The library you are proposing could be interesting, I was actually looking for something like this initially. However I see two possible issues: 1- not being able to use commas to split sentences 2- not 100% sure what big improvements this would bring to the project, aside from the obvious Dr. / Mr. / Ms. issue but these aren't very common occurrences with ChatGPT. Could you refresh my imagination and provide some examples where this could be a big help? I am open to looking into it for a future release.

More simply, I could consider adding an option that allows people to choose whether they want commas to be sentence separators.

rotemdan commented 1 year ago

@C-Nedelcu

When commas were used to split sentences, the text-to-speech function was unusable for me, especially with Microsoft Edge cloud voices, which send a new request to a server every time a new fragment is read, so the delay can be significant. The reading sounded very non-fluid and it was hard to follow. I've since changed the code on my own local copy of the extension not to split on commas and it is much more usable now.

The issue with using . to separate sentences isn't just with abbreviations like Mr., e.g.and i.e. (both of which are very common in English) , but also with decimal numbers and code. Since I often use ChatGPT with math problems and code, having the speech break every time a number like 3.14, a date like 03.05.2020, or a piece of code like someObject.someProperty is read makes it less usable in theses types of cases.

I'm not sure that using a library like the one I suggested will fully solve the problem with code, though, since Microsoft's cloud TTS text processing is able to read code and identify these cases even better than a local library does (it will read the example as "some object dot some property"). The library will work for many languages though, and will deal with abbreviations, decimal numbers, dates etc.

Several other approaches:

Modify the code not to split the text when the characters on both sides are digits. Regular expressions can be used to identify numeric sequences of this kind.
Add an option not to split on commas.
Add an option to instead split the text by paragraphs or line breaks. The delay would be greater, but it would be more usable for code fragments and similar complex text (as good as the TTS engine parses it).
Split to sentences using a library, but add an option that would split the results to phrases (using characters like commas).

(this is just a partial list)

(Edit: added alternative where the split sentences are further split using characters like commas).

jt-github commented 1 year ago

As a quick-fix, that comma-option would be a good start. [X] Pause at commas - starts speaking faster but may add unwanted pauses depending on selected language/browser/etc.

I'm in the same boat as @rotemdan with the comma-pauses breaking the illusion completely. That said, I am also using Edge and I appreciate the clue about the cloud-based voice generation being another possible source of delay. I tried it out on Chrome and, even when using Google's superior voices (vs. Microsoft's local low-quality voices), the delay caused by the comma is far less noticeable... until the speech generation catches up and is waiting for the next comma-based chunk.

Someday, it would be nice OpenAI allows configuring ChatGPT to respond instantly rather than in the psuedo-typed human-like slowness mode it uses now, so that you wouldn't have to use chunking at all.

jt-github commented 1 year ago

Oh, another super cheap solution to the dot (.) problem - only chunk when it's a punctuation mark followed by a space like . or ,. This way, mid-sentence commas and end-of-sentence periods will act as delimiters, but not ones that are sandwiched in the middle of numbers, dates, or code objects. That won't help with words like Mr., Mrs., etc. but you could use another library to replace such abbreviations with their full-word versions (Mister, Misses, etc.) before chunking up the text.

C-Nedelcu commented 1 year ago

I will definitely be working on this in the next update.

C-Nedelcu commented 1 year ago

An option has been added in v2.0 to avoid using punctuation marks such as commas, semicolons, etc. as sentence delimiters. I hope it helps!

C-Nedelcu / talk-to-chatgpt

Voice playback is not smooth #12

13