Closed ShantanuNair closed 1 year ago
@lukeocodes Any chance a relevant team could look at this?
@ShantanuNair sorry this slipped through the gaps. I'll ask internally about this one!
@lukeocodes this may be related to this other issue: https://github.com/orgs/deepgram/discussions/196
Our team released a fix for it 4 days ago, so @ShantanuNair if you haven't retried this in a few days then it might be worth a quick re-run.
@jjmaldonis @lukeocodes Great, I'll give it another try soon and update this issue.
I can confirm this is fixed — I searched through.
Fun tip: Mojibake (Japanese: 文字化け; IPA: [mod͡ʑibake], "character transformation") is the garbled text that is the result of text being decoded using an unintended character encoding.
Although I'm not sure what this � symbol is commonly called. It is neither "tofu", nor "Mojibake"
What is the current behavior?
I am testing Japanese transcriptions via whisper model. I notice that the API responds with non-utf-8 characters as part of the transcript.
Steps to reproduce
Run a Japanese transcript with whisper model. The exact endpoint I POST'd to was:
https://api.deepgram.com/v1/listen?model=whisper&punctuate=true&diarize=true&smart_format=true&language=ja
Notice outputs like:Expected behavior
Response with correctly encoded utf-8 characters and no invalid characters.
Please tell us about your environment
Noticed this in Node v18. Tested via Postman as well.