Speech to text recognition for voice messages

Frooxius commented 3 months ago

Is your feature request related to a problem? Please describe.

It's not always convenient to listen to a voice messages on Resonite - particularly if they can be long, they can't be skimmed easily or maybe the user is in a loud environment and can't hear it.

Describe the solution you'd like

Run speech to text recognition on received voice messages. This could run automatically after the message has been received or manually triggered for each message.

We would likely use Whisper for this functionality, given it's a very robust model, is free to use and can be run on own hardware.

The actual recognition however would likely run on a server, since the model uses large amount of VRAM and GPU compute, which wouldn't work well when the user is already running Resonite.

As such, this would likely be a Patreon only feature, since this would incur additional costs on our end. This would be kind of in-line with how some other platforms do this though (e.g. Telegram).

Describe alternatives you've considered

N/A - but check additional context

Additional Context

I'm not actually 100 % sure if this is something that people would be interested or that we should work on. I do think it would be beneficial, but I'd actually like to see how much interest there would be in having this feature.

If you'd like to see this, please let us know!

Requesters

No response

shiftyscales commented 3 months ago

Speech to text was also requested in #50 more broadly. I've seen accessibility tooling on-platform (e.g. in Creator Jams) that makes use of speech to text, and optionally as a translation service as well to output speech in multiple languages in text.

There is definitely interest, and use in speech-to-text more broadly across the application- I've not heard of request specifically for use in voice messages- but it is a logical conclusion that some form of transcription could be a beneficial accessibility feature for messaging.

I do have ethical concerns around OpenAI. In particular- their stance on the scraping and use of copyrighted materials in the training of their datamodels.

Likewise- the lack of transparency around where their dataset is sourced, furthers my believe that their datasets, and thus the technology built upon them is not ethical.

trained on 680,000 hours of multilingual and multitask supervised data collected from the web

Frooxius commented 3 months ago

From what I can see #50 is requesting the reverse - text to speech.

The translation would also be useful, but I feel that one is a bit harder, since it needs to be more real-time, which needs to process significantly more data and it might not be something we can afford to provide at cheap price.

I do have concerns with ethical use of OpenAI too, but I feel these concerns are more relevant to their large language models and image diffusion models, since those are recreating new works based on works of other people and artists. The models themselves are closed and not freely accessible either.

Whisper on the other hand is not recreating any works, it's used to analyze audio and transcribe it into words and the code as well as the models themselves are available for free under a permissive MIT license.

Yellow-Dog-Man / Resonite-Issues