Sharrnah / whispering-ui

Native UI for the Whispering Tiger project - https://github.com/Sharrnah/whispering (live transcription / translation)
https://whispering-tiger.github.io/
MIT License
228 stars 12 forks source link

conecting to ws #4

Closed martjay closed 1 year ago

martjay commented 1 year ago

When I first opened the software, he asked me to download the.zip archive, but when I finished downloading, it seemed to say that the hash value of the file could not be detected, and then I closed the software to manually extract the.zip archive, and then I entered the software and it automatically popped up a window to update me to the latest version and download the latest archive. I closed the window and then created the configuration file. After the first time I went in and configured it, it kept downloading things.But when I saw the download progress reached about 40%, the software flashed back, I opened the software again, it always showed "conecting to ws", is there a way to solve it?

https://user-images.githubusercontent.com/44982189/227405847-d809dea9-9654-420b-8833-ce8c15e33a70.mp4

martjay commented 1 year ago

It seems that the main problem appears here, after I quit the software, there are other processes used by this software that have not exited together, resulting in a connection to ws all the time. I hope you can fix this BUG.

martjay commented 1 year ago

Can you add subtitle file translation?

martjay commented 1 year ago

The download stopped before the model was finished. The software was stuck when downloading to 1.2GB. I don't know how to solve it. 555555

martjay commented 1 year ago

I think you should change the order in which the software loads the models. First of all, because they both use different models, one model already consumes a lot of video memory, and if you load multiple models at the same time, you will exceed the hardware video memory. I had a serious problem with the download process, it is still downloading, I have tried to download many times, but all failed in the middle of the download. Can you let it first determine whether the model selected by the configuration file exists, and then start loading, for example, if the selected model does not exist, do not load the model first, let it download the model first and then start loading one by one, because during the download process, I started to stutter many times in the middle, the software response became very slow, and finally the error ended after half of the download.

555555

martjay commented 1 year ago

Or if I just want text to speech, I just load a model and don't need to load whisper? I don't think this tool is suitable for game translation, because it consumes too much video memory, the game needs video memory to run, can it be made into a local translation software, local text to speech, local subtitle recognition tool. This video may end up downloading the same result, download failed to load failed, the model is very stuck during loading, you need to optimize it.

https://user-images.githubusercontent.com/44982189/227452064-03522508-94ca-40f3-b0a2-3d22db46cbf3.mp4

martjay commented 1 year ago

555555

Sharrnah commented 1 year ago

Yes it can consume a lot of memory. But it all depends on which model sizes and precisions you select.

Its hard to know that beforehand, but i might try to add some estimates before actually starting.

But the checksum failure i think has nothing to do with that. Which model did you try to load? (faster-whisper? Model size?, which precision?) Because i am wondering about the predicted size in your screenshot. Only the large models are above 2GB.

Yes it can happen that python might get stuck even when closing the UI, even though the UI tries its best to kill it when its closed. But i only very rarely had it happen myself. I might have an idea though how i could kill it when its already running and the UI tries to start it again.

About your suggestion to only load the TTS. Yes that could be an option i haven't added yet. Actually currently you can only disable TTS and the Text translation completely so they are not even loaded. But even that for the text translation thats only possible by editing the settings.yaml Can i ask what exact usecase that would be? Only to type and have it send by TTS and maybe OSC?

martjay commented 1 year ago

Yes it can consume a lot of memory. But it all depends on which model sizes and precisions you select.

Its hard to know that beforehand, but i might try to add some estimates before actually starting.

But the checksum failure i think has nothing to do with that. Which model did you try to load? (faster-whisper? Model size?, which precision?) Because i am wondering about the predicted size in your screenshot. Only the large models are above 2GB.

Yes it can happen that python might get stuck even when closing the UI, even though the UI tries its best to kill it when its closed. But i only very rarely had it happen myself. I might have an idea though how i could kill it when its already running and the UI tries to start it again.

About your suggestion to only load the TTS. Yes that could be an option i haven't added yet. Actually currently you can only disable TTS and the Text translation completely so they are not even loaded. But even that for the text translation thats only possible by editing the settings.yaml Can i ask what exact usecase that would be? Only to type and have it send by TTS and maybe OSC?

The process resides, occurs when the download fails, and if you close the software and open it again, it will always show "conecting to ws ...", This is probably the biggest reason for the error, the normal exit to close the software I do not seem to have seen the process resident situation

Sharrnah commented 1 year ago

I see. So it only gets stuck on the error when downloading the whisper model fails.

I checked the Code and it appears it fails to download the regular whisper model. Since thats hosted directly on OpenAI i can't do much about that download error (except maybe hosting it myself, like i do already with the faster-whisper model).

Is there a specific reason why you try to load the original whisper model instead of faster-whisper? I am hosting the faster-whisper models on my own S3 Bucket and it is using a different downloading library. Maybe that gives you a better download.

But i will add an exception handling around the original whisper model loading code to catch that and for now quit the app until i thought about a better error handling in this case.

I also really recommend using faster-whisper since that gives pretty much the same translation quality but with a lot smaller model sizes and memory usage with at the same time a lot faster speeds.

martjay commented 1 year ago

I see. So it only gets stuck on the error when downloading the whisper model fails.

I checked the Code and it appears it fails to download the regular whisper model. Since thats hosted directly on OpenAI i can't do much about that download error (except maybe hosting it myself, like i do already with the faster-whisper model).

Is there a specific reason why you try to load the original whisper model instead of faster-whisper? I am hosting the faster-whisper models on my own S3 Bucket and it is using a different downloading library. Maybe that gives you a better download.

But i will add an exception handling around the original whisper model loading code to catch that and for now quit the app until i thought about a better error handling in this case.

The above BUG that failed to download and stuck occurred in: It has finished downloading the first model, but it will start to stutter when downloading the second model.I don't know what the second model downloaded.

555555

Sharrnah commented 1 year ago

Sorry but i am not sure what you mean with "second model". By default the app only loads a single whisper model, even when in realtime mode.

The loading order currently is like this:

  1. download and initialize the text translation model
  2. download and initialize the language classification model
  3. (optional) download the faster-whisper model when its used.
  4. download and initialize the Voice Activity Detection model if its enabled
  5. start the websocket thread where the TTS is loaded and initialized.
  6. will start the whisper in a seperate thread which either downloads and initializes the original whisper model when its used, or just loads the previously downloaded faster-whisper model.

since all things except the last 2 steps are done in sequence and is blocking, it should not try to download too much models at the same time.

I really think it might just be a downloading issue either on the server side or client side. When all other models download without issues, i would guess a server side issue.

--

Would you kindly enable the "Faster Whisper" checkbox and see if this works better? It should be the default anyway since the last 2 updates, and it will download from the same servers as the other A.I. models (except the Text 2 Speech).

martjay commented 1 year ago

Sorry but i am not sure what you mean with "second model". By default the app only loads a single whisper model, even when in realtime mode.

The loading order currently is like this:

1. download and initialize the text translation model

2. download and initialize the language classification model

3. (optional) download the faster-whisper model when its used.

4. download and initialize the Voice Activity Detection model if its enabled

5. start the websocket thread where the TTS is loaded and initialized.

6. will start the whisper in a seperate thread which either downloads and initializes the original whisper model when its used, or just loads the previously downloaded faster-whisper model.

since all things except the last 2 steps are done in sequence and is blocking, it should not try to download too much models at the same time.

I really think it might just be a downloading issue either on the server side or client side. When all other models download without issues, i would guess a server side issue.

--

Would you kindly enable the "Faster Whisper" checkbox and see if this works better? It should be the default anyway since the last 2 updates, and it will download from the same servers as the other A.I. models (except the Text 2 Speech).

It is very likely that the large CT2 model caused this problem

Sharrnah commented 1 year ago

not sure. According to your error and your screenshot, you are not using a CT2 model. thats only used when enabling Faster Whisper.

martjay commented 1 year ago

not sure. According to your error and your screenshot, you are not using a CT2 model. thats only used when enabling Faster Whisper.

I downloaded and loaded the medium model of CT2, no errors occurred and no stuttering, although the configuration page I did not see enabled, but it seems to load the model automatically.

Maybe it's because I'm running out of video memory. My maximum video memory is 8GB. It would be nice to let it automatically release unused models when there's not enough video memory.

Another suggestion: would you consider adding captioning and subtitle translation? And live video captioning. In fact, I think Whisper can do a lot of things, and very meaningful. I see someone else has implemented this feature, but he uses whisper.cpp.

https://github.com/tigros/Whisperer https://github.com/rerender2021/language-shadow

martjay commented 1 year ago

https://github.com/rerender2021/echo

Sharrnah commented 1 year ago

It could be that you created the Settings Profile before one of the Updates that has faster-whisper set enabled by default. Then it will keep the previous setting of the profile.

the mentioned error is not a video memory error. It seems it just fails to download the model from the OpenAI server.

Also the faster-whisper is much more suited for lower VRAM graphics-cards. I can run the medium model of faster-whisper in only around 1.7GB VRAM when using float16 precision or even only around 1GB when using int8_float16 precision.


From what i can see of the rerender2021/echo project, is that Whispering Tiger can already do that. You can use one of the existing html based websocket clients: https://github.com/Sharrnah/whispering/blob/main/documentation/websocket-clients.md

and just have that show the realtime translation of a video while routing the PC audio to the application using Voicemeeter (or probably EarTrumpet, havent tested that yet though, but it was recommended from another user.)

About captioning and subtitle translation, i think about it. It would require reading audio and video files, or if you mean just the subtitle files, reading these and translating using the text-translator. Definetly possible, but not currently high on my priority list since the main reason for creating whispering tiger was to have live translation of other players (or even videos without direct access to the video file, like when playing a youtube video in VRChat).

martjay commented 1 year ago

It could be that you created the Settings Profile before one of the Updates that has faster-whisper set enabled by default. Then it will keep the previous setting of the profile.

the mentioned error is not a video memory error. It seems it just fails to download the model from the OpenAI server.

Also the faster-whisper is much more suited for lower VRAM graphics-cards. I can run the medium model of faster-whisper in only around 1.7GB VRAM when using float16 precision or even only around 1GB when using int8_float16 precision.

From what i can see of the rerender2021/echo project, is that Whispering Tiger can already do that. You can use one of the existing html based websocket clients: https://github.com/Sharrnah/whispering/blob/main/documentation/websocket-clients.md

and just have that show the realtime translation of a video while routing the PC audio to the application using Voicemeeter (or probably EarTrumpet, havent tested that yet though, but it was recommended from another user.)

About captioning and subtitle translation, i think about it. It would require reading audio and video files, or if you mean just the subtitle files, reading these and translating using the text-translator. Definetly possible, but not currently high on my priority list since the main reason for creating whispering tiger was to have live translation of other players (or even videos without direct access to the video file, like when playing a youtube video in VRChat).

Can you make it support saving.wav

Sharrnah commented 1 year ago

Sorry. don't really understand. Do you mean:

The first one (Saving Text to Speech as Wav). Is already possible. Its just not (yet) exposed to the native UI.

The second one is not (yet) implemented. But should not be so difficult. :) Will add that to the todo list.

martjay commented 1 year ago

Sorry. don't really understand. Do you mean:

* Saving Text to Speech output to a wav file?

* transcribe / translate a wav file using the Whisper A.I. (Speech to Text)?

The first one (Saving Text to Speech as Wav). Is already possible. Its just not (yet) exposed to the native UI.

The second one is not (yet) implemented. But should not be so difficult. :) Will add that to the todo list.

Saving Text to Speech as Wav, yes!

Sharrnah commented 1 year ago

okay. to do that until i implemented that into the native UI, you can open the websocket client html in your browser of choice.

You can find that html in the folder websocket_clients)/websocket-remote/index.html

When it connected to the backend, it will show some options. Just enter some text into the right textbox and press the button i marked in this screenshot:

grafik

martjay commented 1 year ago

okay. to do that until i implemented that into the native UI, you can open the websocket client html in your browser of choice.

You can find that html in the folder websocket_clients)/websocket-remote/index.html

When it connected to the backend, it will show some options. Just enter some text into the right textbox and press the button i marked in this screenshot:

grafik

I open the html and it says "Connection is closed... retrying" And then it just goes blank. I can't see anything.

Sharrnah commented 1 year ago

You might need to tell it the websocket IP and Port in case you changed it.

for example you can add ?ws_server=ws://127.0.0.1:5001 if you changed the Port to 5001.

So the full URL would look something like this: file:///E:/AI/Whispering-Tiger/websocket_clients/websocket-remote/index.html?ws_server=ws://127.0.0.1:5001

it uses the same protocol as the native UI, so if the native UI works, the html should work as well.

martjay commented 1 year ago

You might need to tell it the websocket IP and Port in case you changed it.

for example you can add ?ws_server=ws://127.0.0.1:5001 if you changed the Port to 5001.

So the full URL would look something like this: file:///E:/AI/Whispering-Tiger/websocket_clients/websocket-remote/index.html?ws_server=ws://127.0.0.1:5001

it uses the same protocol as the native UI, so if the native UI works, the html should work as well.

Same problem, can't use

Sharrnah commented 1 year ago

thats strange.

if that does still not work i have no real idea currently except waiting for me to implement that feature into the native UI.

Sharrnah commented 1 year ago

I added the export .wav function to the native UI.

https://github.com/Sharrnah/whispering-ui/releases/latest

grafik

martjay commented 1 year ago

I added the export .wav function to the native UI.

https://github.com/Sharrnah/whispering-ui/releases/latest

grafik

Thank you, brother, for continuously disdaining to improve this software.

Another idea: Is it possible to adjust the speaking speed of text 2 speech?

I have two questions:

  1. When starting, it always shows me to download something. 777777

  2. Always show loading Whisper. 888888

Sharrnah commented 1 year ago

to point 1.:

Did you update or are you always pressing "No"? If you don't want to update, the new UI version has an option to disable the update check at startup. image

But i would recommend updating it. there can always be updates that need changes in the UI and Python at the same time.

About point 2.: It seems it shows an error about starting with CUDA support.

about your idea with slower Text 2 Speech: That is already possible. You can change the speed or pitch globally in the Advanced -> Settings: grafik

or change it inside the Text using SSML. grafik

you can read about all SSML supported tags of the TTS engine here: https://github.com/snakers4/silero-models/wiki/SSML

martjay commented 1 year ago

Thank you. I learned a lot today. I also want to download the latest version, but my computer doesn't seem to be able to download it, so how can I update to the latest version? Is there any other way? Even if I turn on the network proxy, it still cannot be downloaded. "My computer has Cuda 12.0 installed, and it seems that all features are available, but it gives a hint: error exceeds the graphics memory.". This prompt is similar in the log. I think this speaker list seems to be able to continue to be optimized, such as adding the name of the speaker, adding the previous and next speakers.

777777

Sharrnah commented 1 year ago

sorry. i couldn't read the full error because of the loading message in front of it.

If the error shows something around "exceeds the graphics memory" you can do these things:

grafik

About the download issue, i think this might be a firewall issue. So either look into your firewall if it is blocking some connections, or try it the manual way:

this will then tell the UI app that it already has the newest version and won't ask for an update again until there is really a new update available.


I am not sure what you mean with the "speaker list". The application does not have any speaker detection (yet).

martjay commented 1 year ago

text to speech: I think this speaker list seems to be able to continue to be optimized, such as adding the name of the speaker, adding the previous and next speakers.

I'm sorry if I didn't make myself clear. It's hard to know how they sound if there are just a lot of number options, but it's much easier to choose if you have a name or audition.

Sharrnah commented 1 year ago

thank you for the suggestion. I am only taking the names as they are from Silero TTS. And since they just gave it numbers for the english voices, they are only shown with numbers in the list.

I don't really intend going through over 100 voices and give them names. :) But it might be a good suggestion for the Silero project https://github.com/snakers4/silero-models

i guess you mean with "previous and next speakers" to have buttons to quickly go to the next or previous speaker? You think that could speed up the testing so you find the fitting one faster? Because that could be a thing i might be able to add to the UI.

martjay commented 1 year ago

thank you for the suggestion. I am only taking the names as they are from Silero TTS. And since they just gave it numbers for the english voices, they are only shown with numbers in the list.

I don't really intend going through over 100 voices and give them names. :) But it might be a good suggestion for the Silero project https://github.com/snakers4/silero-models

i guess you mean with "previous and next speakers" to have buttons to quickly go to the next or previous speaker? You think that could speed up the testing so you find the fitting one faster? Because that could be a thing i might be able to add to the UI.

Yes, this is my mean.

Sharrnah commented 1 year ago

I will close this issue now. I will keep your next/previous voice idea in mind, but i also want to maybe add another TTS in the future, so i have to see where it goes.

Feel free to open a new issue if you have any more questions.