is api access for live translation support possible? planned?

dreamcat4 commented 1 year ago

So the way this might work:

liveCaptions application has speech to text language model
liveCaptions application sends through api / JSON / rest / similar protobuf rpc mechanism each various sized block(s) of word(s). it might be groupings and sub-groupings of {1,2,3.. n} words
the translation service might sit on local computer as a different software, or it might sit on a remote web server. same difference
gets returned through api web request results from text based translation service
pieces back together the {1,2,3} word groupings into segments. which gets outputted to window. and/or re-written recent segments. for example if was made a booboo, but then got subsequently caught before getting too far ahead into the next future segment(s)

this all to emulate / replicate / replace functionality of the youtube live captions translation mechanism.

I believe that this goal could be possible. And to use the existing LiveCaptions application as a base. However it would take some work, and some efforts.

Therefore it probably (most likely) needs developer resources, funding etc. As a sort of 'upgrade' or 2nd round of product iteration. To make as a new project efforts (but is same project, just extends / returns as 'new season').

Does this sound like a worthwhile goals for this LiveCaptions application? Does exist already some discussions towards adding future translation mechanism?

abb128 commented 1 year ago

A primary principle of this application is to run completely offline and not share your data to any third party, ideally the flatpak doesn't even have the network permission. So the translation model would ideally need to run locally, which is something I just haven't looked into much

One potential major issue preventing this from being as usable as YouTube translated captions is that this application runs real-time and doesn't see future words, while translation can heavily depend on future words and context. It wouldn't be easy to read if the words are constantly shifting and changing as more words come in.

dreamcat4 commented 1 year ago

1) ideally locally yes, and there is already an oss project called apertium that can be run completely locally from local models.

the difficults is that apertium may prove to be difficult to install, and/or either complex to confiure and operate. or have certain other drawbacks in its usage. i know this myself having tried to install apertium (and having failed miserably).

therefore permitting a backup or alternativve by making a generic api mechanism (that is not specific solely to apertium). would be deirable. and let the user make the wisest an most informed choice based on their own specific needs. and of course also for any other foss alternatives to apertium too (not just online ones).

this i why i think the choice of translation software should be de-coupled from this project. since ultimately we can encourage and facilitate or strongly promote foss options. but to enforce would be unduly draconian. and just out of the scope / juristiction of separating out the tranlation aspect as a truly open api access

2) agreed. but this is where some simple 2-3 settings would come into play. to set the max-min range of word groupings. and/or a silence pause delay in speech (to denote completion of a self contained thought or phrase). plus some maximum allowed duration for lagging behind. since there must be some necessary delay.

could also add another option for 'how many rewrites'. the max number of times the text is canceled or ammended before moving on.

so with some sensible set of defaults then yes: it would probably be [more laggy than youtube]. if youtube actually looks ahead in the buffering. however i myself am not actualy familiar if youtube does then by how far ahead. so the actual lagginessm? not so sure.

but lets say for argument's sake the speaker speaks in concise, self contained sentences or phrases. then they typical expected lag behind would be about 1-2 sentence. (or 1.1 sentence if 0.1 of time is needed to actually to the processsing of the translation).

for me: this is perfectly tolerable compared to understanding 0 of the conversation.

furthermore lets say the translation just is not very good quality... this also is many times better than 0. and again would not be that far from my existing youtube auto translate experience. which is often bad, yet still enough to get by some approximate ballpark understandings.

3) new question: might we also get a final transcript at the end? perhaps that feature was already implemented as its own thing, without needing much consideration here, other then to just re-run the whole text at 'end' (or manually). in a retroactive sense.

so perhaps (i am not sure at all...) if there was kept maybe 2 vesrions of text to remember. that later on the original can be the translated. rather than forgetting the buffered words or replacing / swapping them.

hopefully this commentary can give some better sense or ability to picture what necessary work might be involved. in order to implement the feature at some satisfactory level(s) off quality. to end up being some sort of a usable feature. of course there may very well also be certain other important considerations too! (which i myself have not considered / put forward here)

BMomani commented 4 months ago

thanks a lot for the tool and thanks for bringing this translation idea up one use case is to allow us to translate (😮) voices captured from a games

abb128 / LiveCaptions

is api access for live translation support possible? planned? #70