Whisper non-utf-8 characters for Japanese

deepgram / deepgram-js-sdk

Official JavaScript SDK for Deepgram's automated speech recognition APIs.

https://developers.deepgram.com

MIT License

138 stars 51 forks source link

Whisper non-utf-8 characters for Japanese #133

Closed ShantanuNair closed 1 year ago

ShantanuNair commented 1 year ago

What is the current behavior?

I am testing Japanese transcriptions via whisper model. I notice that the API responds with non-utf-8 characters as part of the transcript.

Steps to reproduce

Run a Japanese transcript with whisper model. The exact endpoint I POST'd to was: https://api.deepgram.com/v1/listen?model=whisper&punctuate=true&diarize=true&smart_format=true&language=ja Notice outputs like:

 ��るだけ目にし

Expected behavior

Response with correctly encoded utf-8 characters and no invalid characters.

Please tell us about your environment

Noticed this in Node v18. Tested via Postman as well.

ShantanuNair commented 1 year ago

@lukeocodes Any chance a relevant team could look at this?

lukeocodes commented 1 year ago

@ShantanuNair sorry this slipped through the gaps. I'll ask internally about this one!

jjmaldonis commented 1 year ago

@lukeocodes this may be related to this other issue: https://github.com/orgs/deepgram/discussions/196

Our team released a fix for it 4 days ago, so @ShantanuNair if you haven't retried this in a few days then it might be worth a quick re-run.

ShantanuNair commented 1 year ago

@jjmaldonis @lukeocodes Great, I'll give it another try soon and update this issue.

ShantanuNair commented 1 year ago

I can confirm this is fixed — I searched through.

Fun tip: Mojibake (Japanese: 文字化け; IPA: [mod͡ʑibake], "character transformation") is the garbled text that is the result of text being decoded using an unintended character encoding.

Although I'm not sure what this � symbol is commonly called. It is neither "tofu", nor "Mojibake"