Text on S3 Boxes - Githubissues

jlpouffier commented 3 months ago

The idea is coming from the Voice Assistant Contest. Credit to user Lajos and his entry.

Features added

Spoken text is displayed on the box during the thinking phase

Response text is displayed on the box during the replying phase

This behavior is user-configurable via a switch called Display conversation on Home Assistant. CleanShot 2024-03-14 at 17 33 47

The value of the switch is restored, but ON by default if no value is found (It will be ON when updating for the first time)

Specific changes of the firmware.

Allowed characters

In ESPHome, we need to load what character we are planning to display. Because the firmware is supposed to be used by all our supported languages, I searched for a proxy that would be a good approximation of every character that we could display. I ended up extracting all unique characters used in our test file on the intent repository of Home Assistant

This is this part of the firmware:

  # These unique characters have been extracted from every test file of every language available on https://github.com/home-assistant/intents (14 March 2024)
  allowed_characters: " !#%'()+,-./0123456789:;<>?@ABCDEFGHIJKLMNOPQRSTUVWYZ[]_abcdefghijklmnopqrstuvwxyz{|}°²³µ¿ÁÂÄÅÉÖÚßàáâãäåæçèéêëìíîðñòóôõöøùúûüýþāăąćčďĐđēėęěğĮįıļľŁłńňőřśšťũūůűųźŻżŽžơưșțΆΈΌΐΑΒΓΔΕΖΗΘΚΜΝΠΡΣΤΥΦάέήίαβγδεζηθικλμνξοπρςστυφχψωϊόύώАБВГДЕЖЗИКЛМНОПРСТУХЦЧШЪЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюяёђєіїјљњћאבגדהוזחטיכלםמןנסעפץצקרשת،ءآأإئابةتجحخدذرزسشصضطظعغفقكلمنهوىيٹپچڈکگںھہیےংকচতধনফবযরলশষসచయలഅആഇഈഉഎഓകഗങചജഞടഡണതദധനപഫബഭമയരറലളവശസഹൺൻർൽൾაბგდევზთილმნოპრსტუფქყშჩცძჭხạảấầẩậắặẹẽếềểệỉịọỏốồổỗộớờởợụủứừửữựỳ—、一上不个中为主乾了些亮人任低佔何作供依侧係個側偵充光入全关冇冷几切到制前動區卧厅厨及口另右吊后吗启吸呀咗哪唔問啟嗎嘅嘛器圍在场執場外多大始安定客室家密寵对將小少左已帘常幫幾库度庫廊廚廳开式後恆感態成我戲戶户房所扇手打执把拔换掉控插摄整斯新明是景暗更最會有未本模機檯櫃欄次正氏水沒没洗活派温測源溫漏潮激濕灯為無煙照熱燈燥物狀玄现現瓦用發的盞目着睡私空窗立笛管節簾籬紅線红罐置聚聲脚腦腳臥色节著行衣解設調請謝警设调走路車车运連遊運過道邊部都量鎖锁門閂閉開關门闭除隱離電震霧面音頂題顏颜風风食餅餵가간감갔강개거게겨결경고공과관그금급기길깥꺼껐꼽나난내네놀누는능니다닫담대더데도동됐되된됨둡드든등디때떤뜨라래러렇렌려로료른를리림링마많명몇모무문물뭐바밝방배변보부불블빨뽑사산상색서설성세센션소쇼수스습시신실싱아안않알았애야어얼업없었에여연열옆오온완외왼요운움워원위으은을음의이인일임입있작잠장재전절정제져조족종주줄중줘지직진짐쪽차창천최추출충치침커컴켜켰쿠크키탁탄태탬터텔통트튼티파팬퍼폰표퓨플핑한함해했행혀현화활후휴힘，？"

Because this solution is not perfect, this list is loaded as a substitution so that a user can still add a few missed characters in the list.

2-stage thinking phase.

Interestingly enough, we are starting our thinking phase at the end of the VAD stage, in the middle of the STT phase. This is because we want to take into account the time it takes for the STT engine to fully decode the spoken command.

This means that when we start our thinking phase, the spoken text is not known, the silence has just been detected, and the processing of the last chunk of audio is still ongoing.

At first, I thought that this would be an issue, but I like it even better now.

At the end of the VAD stage, the thinking phase is displayed. the spoken text is not known so 3 dots ... are displayed instead. (Basically meaning: " I am still trying to figure out what you told me")
Once the STT phase is over, the screen is refreshed with the spoken text (Basically meaning: "Ok now I understand what you told me... I am still figuring out what to do, and how to reply to you")

It is visible when the STT engine is slow. CleanShot 2024-03-14 at 17 31 22