Glitchy audio after 5 to 10 minutes

roschler commented 4 years ago

I have noticed in my scene that after about 5 to 10 minutes of generating audio, the audio gets "glitchy". In other words, after having my Sumerian Host characters (3 of them) talk for about 5 to 10 minutes straight, the audio starts to develop really bad ticks and pops in it, to the point it is unlistenable (think of a really old, scratchy vinyl record being played).. This usually ends up being an audio buffer handling problem somewhere. It happens every time and the symptoms are always the same. A few minutes of perfect audio, then a single tick or a pop here or there, and then once the ticks and pops start happening, it exponentially increases rapidly to the unlistenable stage to where it happens with every audio buffer being played.

I know this might be a Chrome thing, but just in case, is there a way to "patch" into the audio generation code of the Sumerian host library and play the audio with something else, in my case Howler.js, while maintaining the lip sync between the audio and the character/host animation?

c-morten commented 4 years ago

Hi @roschler, I'm happy to look into this. Can you provide some steps for how to reproduce this issue? If you had a transcript of host speech that would cause this to occur that would be really helpful. Also would be good to get some information about your browser and version and which rendering engine you're using. Have you tried using a different browser like Firefox to see if the issue still occurs?

For the audio implementation, we are using Web Audio. For the three.js build we create an Audio object and either a THREE.PositionalAudio or THREE.Audio object, then connect the Audio object to the three.js Audio object using Audio.setMediaElementSource. In Babylon.js the rendering engine handles the creation of the web audio object, we just pass the url to the constructor of BABYLON.Sound. If you wanted to circumvent the host audio I can think of two possibilities. The first option is a little hacky but quicker to implement. You could call the setVolume method of the TextToSpeechFeature to set it to 0. Then you could listen for the TextToSpeechFeature's play event, which will supply a Speech object as an argument to your listener function. When you catch the event you could immediately pause speech and use the Speech object's audio property to get a handle to the Web Audio object, which will point you to the url the audio was loaded from. Use that to create your Howler.js audio and then play your resulting audio once it's ready at the same time as resuming speech on the host. The second option would be to pull the repository and create your own custom build that overloads the speech implementation. TextToSpeechFeature._synthesizeAudio is where you'd need to create your custom audio. You may also need to overload play/pause/resume/stop of the Speech class depending on how Howler.js audio works. I can provide more details on the second option if you do want to try that.

roschler commented 4 years ago

@c-morten The text doesn't matter. I've run a lot of tests. To test for yourself, just grab 5 to 10 minutes worth of text off the web, anywhere, and just keep generating TTS with the host TextToSpeechFeature facility with it. It's a quantity thing.

A big thanks for the audio internals details. Hopefully it doesn't come to that (at least for now). Eventually I'll want that to replace the audio anyways but hopefully not for a lon gitme. I'd like to apply volume and sound effects to the voices eventurally and I don't think there's a way to do that with the current library. Note, it would be nice if there was an easy way to swap out control of the audio so that once the audio needs to be started in sync with the viseme stream to effect lip sync TTS, the audio side of the things could be handed off to a consumer provided callback.

For now, I'm going to try the same test on my other stations. Hopefully it's an Ubuntu 14.04 audio driver issue and nothing else. That's an old Linux build.

roschler commented 4 years ago

@c-morten Does the Sumerian Hosts use WAV or MP3 generated audio when creating TTS through Polly via the TextToSpeechFeature._synthesizeAudio call? I found this Stack Overflow post that mentions crackling audio when using WAV formatted audio and suggests switching to MPE:

https://stackoverflow.com/questions/6955957/html5-audio-crackle-in-chrome

c-morten commented 4 years ago

The audio format is specified in the options you pass in when adding the TextToSpeechFeature, or when you play speech. If you don't define it we default to MP3, so you most likely were not getting WAV audio.

roschler commented 4 years ago

Ok, thanks. I was hoping it was WAV. Looks like I'm going to have fork and dig deeper. It happens on all stations.

c-morten commented 4 years ago

I have not yet been able to reproduce this, I have run test audio for over 30 minutes straight on all 3 builds with no issues yet. Can I get more information on your test scenario:

Which build are you using?
Is the browser and tab that's playing the audio active for the entire time leading up to when the issue occurs?
When it starts happening, do you notice any memory spikes in the console?
Do you encounter the same issue when using Firefox instead of Chrome

roschler commented 4 years ago

I'll try the tests you suggest.

One major point. I have 3 hosts on the screen and each of them speaks for 1/3 of the time (Luke, Cristine, & Alien). In other words,the dialogue is split across them like 3 actors in a scene, each taking their turn. I mention this in case it turns out to be a memory issue or similar that only presents itself with frequent starting/stopping TTS between multiple hosts. If you are using a single host and just spooling dialogue continuously through it, then this could be a key difference.

On Wed, Sep 9, 2020 at 8:29 PM c-morten notifications@github.com wrote:

I have not yet been able to reproduce this, I have run test audio for over 30 minutes straight on all 3 builds with no issues yet. Can I get more information on your test scenario:

Which build are you using?

Is the browser and tab that's playing the audio active for the entire time leading up to when the issue occurs?

When it starts happening, do you notice any memory spikes in the console?

Do you encounter the same issue when using Firefox instead of Chrome

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aws-samples/amazon-sumerian-hosts/issues/20#issuecomment-689896099, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDR2B2FGABAGPMTAGXLHUTSFAMVZANCNFSM4Q74APCA .

-- Thanks, Robert Oschler Twitter: https://twitter.com/roschler http://twitter.com/roschler LinkedIn: https://www.linkedin.com/in/natlang/

roschler commented 4 years ago

@c-morten Just tested on FireFox. Happens with FireFox too, same pattern too.

Where do I look to give you the correct answer to "what build are you using?"?

Regarding memory spikes, do you mean in the main system monitor or in the Chrome Task Browser (i.e. - Chrome's internal system monitor)?

Here's a note, not related to the audio crackling. Just a general comment about Sumerian Hosts audio on FireFox compared to Chrome. On Chrome, before the crackling occurs, the audio is smooth. On FireFox, the audio seems to get "clipped" at the start and the end of the waveform. If you have ever worked with music gear it feels like a noise gate with the volume threshold set too high, so when the audio starts there's an abrupt jump from no sound to some sound instead of a gentle, smooth easing in like sound normally does.

roschler commented 4 years ago

@c-morten I watched memory/CPU/GPU in both the main System Monitor (Windows 8) and the Chrome Task Manager. Memory did not jump around much, but I did see something strange. When the audio was smooth, the CPU% was around 46% and then dropped to about 5% when the host animation/audio playing stopped. However, when the audio started crackling, especially heavily, the CPU was around 79% or worse. Also, after the scene stopped, the CPU stayed at that same high consumption level instead of dropping precipitously like it usually does after a scene stops. It's as if something in the browser is stuck doing something and won't stop.

This is wild speculation, but if for some reason some audio rendering process got stuck, then further attempts to play audio could easily cause crackling since the audio buffers would not be delivered properly with gaps between their delivery. This would get worse with each attempt if each attempt added another stuck audio process on the "stack".

roschler commented 4 years ago

@c-morten I found a tutorial on debugging web audio problems using Chrome DevTools, especially in regards to crackling:

https://web.dev/profiling-web-audio-apps-in-chrome/

Here are some screenshots showing the performance metrics before and after crackling has begun. I have drawn boxes around the stats that are most notable (to me):

VIEW: tracing

SECTOR: AudioOutputDevice

 PHASE: Before Crackling Has Begun

 PHASE: During Crackling

_NOTE: For the wasapi_renderthread, I didn't see any glaring differences, but when I look at the average durations the load appears to be about 25% greater during crackling compared to before crackling.

VIEW: tracing SECTOR: wasapi_render_thread

 PHASE: Before Crackling Has Begun

 PHASE: During Crackling

VIEW: WebAudio Tools

NOTE: Look at the status line at the bottom of the screen for each of the following screenshots.

 PHASE: IDLE (i.e. - baseline, **before** any audio rendering has begun)
 NOTE: All values in the status line are zero.

 PHASE: ACTIVE (i.e. - actively rendering scene and audio, but **before** crackling has begun)

 PHASE: DURING CRACKLING (i.e. - the scene is rendering and crackling has made the audio unlistenable)

 PHASE: IDLE, AFTER CRACKLING HAS BEGUN (i.e. - the scene is no longer rendering, after crackling has made the audio unlistenable)

As you can see, the audio rendering system is completely damaged. I tried the trash can icon to execute an explicit garbage collection operation, and it did not help at all, no change. Note, the tutorial I linked to above also has tips on how to restructure audio rendering code to try and correct problems that might be causing the audio rendering difficulties. Let me know if you need anything else.

c-morten commented 4 years ago

Thanks for the link, I'll try debugging this way. In regards to figuring out which build you are on, are you using host.three.js or host.babylon.js? These would either be referenced in a script tag in your html file or you would have installed amazon-sumerian-hosts via npm and imported one of those.

roschler commented 4 years ago

@c-morten

Here's the package.json reference for amazon-sumerian-hosts:

  "devDependencies": {
    "amazon-sumerian-hosts": "^1.3.1"
  }

I am using host.three.js.

roschler commented 4 years ago

Any updates? I still have this problem and it happens consistently.

c-morten commented 4 years ago

I have not had much luck reproducing this yet, it's not happening for me within even 30 minutes so it's difficult to know how long I need to let things run before calling it quits. Since it is happening consistently for you, there are a few things I would want to test that you might give a try, it would be good to know your results:

Can you reproduce this using three.js traditional audio rather than positional audio? To do this, do not define the attachTo property of the options object you pass when creating the TextToSpeechFeature. If this option is not defined it will default to creating a three.js Audio object rather than a PositionalAudio object.
A little more involved, but can you reproduce this using the host.babylon.js build rather than host.three.js? Trying to determine if this is specific to the rendering engine audio system since hosts hook into the audio system of the rendering engine being used.
Last resort, I would try generating audio files for the dialog you are passing to the host system using the AWS Polly console. Then create an application that uses three.js without the host package and play that audio in sequence using the three.js audio system. Does this reproduce the issue?

roschler commented 4 years ago

"Can you reproduce this using three.js traditional audio rather than positional audio? To do this, do not define the attachTo property of the options object you pass when creating the TextToSpeechFeature. If this option is not defined it will default to creating a three.js Audio object rather than a PositionalAudio object."

Thanks. I'll give that a try. I don't have to to do the host.babylon.js test at this time because that would be a massive refactor. But I'll try disabling positional audio as you suggest.

BTW, I found this interesting post that describes problems with :

https://bugs.chromium.org/p/chromium/issues/detail?id=175363

I'm not sure if this is relevant but this and other posts I found describes problems with the user of scriptProcessorNode that can cause crackling audio.

c-morten commented 4 years ago

I'm taking a wild guess here, but I'm thinking there may just be too much audio stored if you are continually playing dialog for long periods of time. We don't have any system in place for managing the storage of audio you are creating, but maybe you could set up a test to confirm whether or not this is actually the case. You will need to access internal host variables to get to the place where the host audio is stored. Assuming you have a HostObject variable named host, we store the speech audio that gets generated in the following location: host._features.TextToSpeechFeature._speechCache. Try setting up a keyboard event to set this variable to {}, then execute that keyboard event once you hear the audio crackling. Monitor the memory to try to determine when the next garbage collection happens after executing that event. Does the next piece of audio that plays after garbage collection happens play back normally?

c-morten commented 4 years ago

I was just scanning through the three.js audio documentation and I noticed there’s a mistake in our three.html example file, I’m wondering if it may be causing your issue. How closely are you following the example code? In our createHost method we’re creating a separate THREE.AudioListener instance for each host. However the three.js docs state that there’s only meant to be one listener per scene. If you are also using multiple listeners, try using just one instead.

roschler commented 4 years ago

To set up the hosts I'm using the code from examples. I just checked my code and indeed three audioListener objects are being added to the camera object (odd place to add a listener object, don't you think?). I'm going to move that code out of the per-host set up code to the scene initialization stage and only do that operation only once. I'll tell you how it goes tomorrow.

roschler commented 4 years ago

@c-morten The audioListener idea was helpful but I don't think it solved the original problem. I say this because now that I only create one audioListener object instead of 3, the glitchy audio still occurs, it just takes 3 times longer to start degrading. This is a big help but I would still like to get rid of the problem completely. When I get the chance I'll try your cache clean-up idea.

Side note. How can I get a list of the emotes? I looked at the emote.glb file but that's in a format that is not readable by a standard editor. When I try and open it I see non-ASCII characters. I see the animations in the gestures.json file that exists for each character, but not the emotes? Does the "Alien" character only have the one "angry" emote?

c-morten commented 4 years ago

Hi @roschler. The .glb format is viewable in DCC applications like blender. You can also import them into glTF viewers like https://gltf-viewer.donmccurdy.com/ and https://sandbox.babylonjs.com/ to be able to preview the names of animations contained within. Currently the "Alien" character only has the "angry" emote, that character has a more limited animation set because it was used as a test to prove out that we could use the PointOfInterestFeature on characters whose rigs have varying proportions and joint orientations/names.

aws-samples / amazon-sumerian-hosts

Glitchy audio after 5 to 10 minutes #20