Open righteoustales opened 2 weeks ago
Do you get a callback with the final set of results isFinal: true
? If so what's the full content of the results at that point? Is the "+5" the only element in the recognition results list? I haven't yet tried to reproduce this but I haven't seen this behaviour before.
Do you get a callback with the final set of results
isFinal: true
? No. But the listen was still active so I didn't expect to.If so what's the full content of the results at that point? Is the "+5" the only element in the recognition results list? Yes
I haven't yet tried to reproduce this but I haven't seen this behaviour before. Here's some more info that I think may be helpful. If I change my SpeechListenOptions to only specify onDevice: true, then I see exactly the behavior reported above. But, if I change onDevice to false I can wait longer than 2 mins and not see this behavior. And that is true even if I turn off the wifi on my laptop. I tested this several times this AM and was able to pause before the "+5" for at least 2 mins and still not see the string returned being reset to empty. I also did not see the listen timeout for the duration of the 2 mins+ that I paused before saying "+5". Switching back to onDevice: true causes the problem to reoccur every time.
How is onDevice different such that it might cause this? I couldn't find a definition for what setting it to true does in the API doc, but intuited that it would not call the cloud for the purpose of voice recognition. But, given it works to set this as onDevice: false and with no internet connectivity, perhaps I intuited wrongly?
Btw, I both downgraded to 6.5.1 and upgraded to 7.0.0 as part of my testing. All have the same behavior.
Thanks for the details. I'll try to reproduce and let you know what I find. If you have a chance to try stopping the recognition and finding out what the recognition result is when final is true that would be interesting.
You're mostly correct about the behaviour of onDevice, I should update the docs to provide more details. with onDevice: true
recognition MUST be done on device and will fail completely if the device cannot do that. When false it is up to the device to decide, some or most may happen on device, particularly with newer devices, but no guarantees are made.
If I issue stop() the final result (ie. when the SpeechRecognitionResult's finalResult == true) is the "+5".
I'm also experiencing the exact same thing. been using this plugin for about a month now and only recently have started noticing this issue. maybe something changed on apples side? also version 6.6.2 on iOS.
i am using something close to the example app but keep track of every time 'isFinal' is true
and keep a history of the whole sentence, for some reason now when i pause and it clears, isFinal
is false.
Seems to be an apple issue: https://forums.developer.apple.com/forums/thread/762952 and https://forums.developer.apple.com/forums/thread/731761
It's unfortunate that the first issue above ../762952 reports that this issue does not occur on ios 17, but started with 18. That is not what we are seeing. I submitted a comment over there informing them that I am seeing the same behavior on ios 17.
@flutterocks What version of IOS are you running?
thanks @flutterocks, this is helpful. @righteoustales is this a problem that started for you relatively recently? The 731761 thread implies the issue only happens with on device recognition, is that what you are seeing?
If it's an iOS issue then I doubt I can do anything useful at the plugin level to help resolve it unless there's been an API change that I missed.
thanks @flutterocks, this is helpful. @righteoustales is this a problem that started for you relatively recently? The 731761 thread implies the issue only happens with on device recognition, is that what you are seeing?
--- That's what I documented above in this thread as well. I only recently started using this flutter library so I can't speak to the history of it working or not. It noticed it was broken immediately after setting the on-device flag to true.
@righteoustales I've recently updated to iOS 18, which is probably why I am only experiencing this now. Though I have onDevice
set to false
.
Perhaps this 'broken' experience happens onDevice is true
on older iOS versions, and as of iOS 18 it happens in either case. Or maybe for some reason iOS 18 will heavily favour onDevice regardless of the value flag.
Regardless of what might be causing this, I agree @sowens-csd, there isn't much this package can do to resolve. Though I will implement the suggestion from 731761, using timestamp to help determine if the result is 'final', likely in combination with comparing against the previous result (to prevent accidentally marking as final if there is latency)
Some pseudo code of what I'm thinking:
likelyFinal = (prevResult.recognizedWords.length > currResult.recognizedWords.length) && ((currResult.timeStamp - prevResult.timeStamp) > X
where i will experiment with X to find what works, likely will have a value of ~1-2 seconds. I'll implement it this weekend on my end.
Given seems to be impacting a lot of users, i could see value in having this directly inside speech_to_text
in addition to finalResult
but i'll leave that up to @sowens-csd given the bloat this would introduce
I'm disappointed to hear that the workaround of setting the flag to false doesn't even work in ios18. @flutterocks, are you the person who reported it over on the apple forum and to whom I replied? It's a different name there, but I'm sure we all have multiple names that we use spanning various forums over the years.
@righteoustales not me no, i just found the threads from some googling to see if the issue was flutter specific or apple
@flutterocks so your thinking in that work around you suggested is that Apple is essentially starting a new recognition? So the goal would be to deliver the previous final results in some way so that the user knows they should be stored and that a new set of results will start? It's an interesting idea. Naively I was hoping that Apple would fix their implementation, but that could of course take a while. One problem is that I've seen some fairly long delays to the final results and that the speech recognition engine will not infrequently reinterpret previous results based on new context, which could result in false positives from that test. Also it would have to be iOS specific since the other engines don't have the same failure mode.
I agree that the impact of the failure is fairly large, it would be good to be able to help mitigate it.
@sowens-csd @flutterocks I was going to point out something similar to the "One problem" comment above. It doesn't work to save what was there previously for comparison as the recognition logic frequently reinterprets what the text first delivered (and second and third) said the more that you speak. For example, in my example above, if I had said:
"add 347.12 + 1"
You can watch in real time as the the first number is first recognized as 300, then 347, and so on as the recognition logic is processing. Given that, it can become very difficult to use comparison to distinguish between a reinterpretation of everything said so far versus when it is simply throwing away all of the preceding text and starting fresh. Have either of you found a way to tell the difference between the two? Does looking at the segment timestamp as proposed above actually work? I don't think that comparing to the previous result is going to work.
This feels like an Apple bug unless they manifest data with the results returned that can reliably be used to prevent the loss of previously spoken text.
Just downloaded apple's SpokenWord demo mentioned in 762952 and I'm experiencing the same issue, so fully confirmed it has to do with apple.
I inspected the results and here's some interesting findings:
speechRecognitionMetadata
is null except for when I take a few seconds pause, aka when I would expect isFinal
to be truebestTranscription
clears, timestamp becomes 0 for the new partial transcriptionnote: onDevice is true in this demo
There's a few options to explore:
isFinal
from the presence of speechRecognitionMetadata
(though not sure if this behaviour is consistent on older iOS and if not onDevice)isFinal
from succeeding transcriptions containing timestamp = 0 In any case these solutions would likely be temporary until / if apple fixes their bug. I'll probably personally wait until iOS18 officially releases next week to see if this still happens, but @righteoustales you're experiencing on 17.6.1. @righteoustales Can you download the apple sample and see if the same behaviour i described above happens?
Sure.
Without any change, it manifests the problem discussed in this thread. Broken.
With only this change:
it does not. Not broken.
Just to confirm are you (@flutterocks ) saying that the one-line change that I did above does not help at all on ios18? Ie. that it drops the text equally whether that flag is set to true or false? And, if so, have you also tried testing it with network connectivity/wifi completely disabled? Any difference then?
Correct, even with requiresOnDeviceRecognition = false
I'm experience the text dropping behaviour and isFinal always is false
that is the case with wifi connected
just tested without wifi and same thing, which is expected given your scenario is ondevice
Thanks for confirming and trying that additional test.
Btw, I also updated https://developer.apple.com/forums/thread/731761 with my own comments/test experience.
That apple forum update I did has not yet been approved for some reason. Slackers. LOL.
I also messed around with setting the task hint between unspecified, dictation, search, and confirmation. None of them help.
@righteoustales are you able to confirm if the following behaves the same for you? (specifically the last two points)
I inspected the results and here's some interesting findings:
- isFinal is always false
- speechRecognitionMetadata is null except for when I take a few seconds pause, aka when I would expect isFinal to be true
- when the bestTranscription clears, timestamp becomes 0 for the new partial transcription
All of the above 3 assertions are true for me as well.
Given how old those two forum questions are on the apple developer forums and the complete absence of any acknowledgment from Apple on either, I'm not feeling very hopeful that they will do anything on this. But, I don't frequent their forums much. Any experience otherwise that is more hopeful than my conclusion here?
My current plan is to see how things look when ios 18 is released and decision accordingly given that. I think I read that that release is imminent, like maybe next week.
@righteoustales Meant to release on the 16th i believe. I too will wait for that and hope for the best...
Btw do the speechRecognitionMetadata and timestamp behave the same for you regardless of what requiresOnDeviceRecognition
is set to?
I’m sure they don’t given setting it to false actually works without dropping text.
Sent from Yahoo Mail for iPhone
On Saturday, September 14, 2024, 5:21 PM, flutterocks @.***> wrote:
@righteoustales Meant to release on the 16th i believe. I too will wait for that and hope for the best...
Btw do the speechRecognitionMetadata and timestamp behave the same for you regardless of what requiresOnDeviceRecognition is set to?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
quick update after upgrading to ios18 since it was released today:
This api is now broken as described herein regardless of whether the requireOnDeviceRecognition flag is set to true or false. Goodbye friendly workaround.
I also updated the two apple forum issues. Maybe a bit of activity there will flush them out of the woodwork to comment on it, but I doubt it.
Summary of where we are from my perspective:
I question whether a developer would ever want this throw-away behavior, but will say with considerable certainty that they would for sure not want it if their task hint was set to "dictation".
Given that, I'm wondering if it is worthwhile for this speech_to_text (flutter) feature to deal with this (what I'm calling a) bug by noticing the deletion/start-over and then mitigating it by (re)prepending the words thrown away. And, if not comfortable with doing that for all cases, then perhaps doing so if the developer indicates they want it (via taskhint or other).
Without something of this nature, the speech-to-text results seem pretty unusable because as noted earlier in this thread the caller of this flutter api:
1) does not have access to the metadata properties (e.g. timestamp reset to 0 visible only via IOS api objects) that indicate the reset occurred 2) also cannot simply compare the current to the previous results to see what was dropped due to the ongoing changes that occur as recognition refines the words recognized. 3) Also, for those that don't enable partial results, the words tossed will never be seen because they are deleted before listening is stopped.
Thoughts?
@righteoustales I'd have to agree with everything you said. Sure seems like a bug to me, at least a pretty major breaking behaviour change if it's not a bug. Supporting some mitigation in the plugin seems like the right path forward. Should Apple fix this then I'd think the mitigation would revert to a no-op since hopefully the timestamp reset would stop happening. I'll try to put together a beta and hopefully some folks can give it a try.
Should Apple fix this then I'd think the mitigation would revert to a no-op since hopefully the timestamp reset would stop happening.
Exactly. I like your thought process around keeping it transparent and such that it becomes a harmless no-op when/if they fix their mess.
@flutterocks I can't reproduce the timestamp == 0 or the metadata non nil only for the replace case either in the plugin or in the iOS speech app you referenced. I can get it to reproduce the behaviour of replacing all the previous content. Where you say timestamp do you mean the SpeechRecognitionMetadata.speechStartTimestamp
?
What I'm seeing is that speechStartTimestamp
is always a non zero value. This is in iOS 18 on a iPhone Xs.
let currentT = result.speechRecognitionMetadata?.speechStartTimestamp ?? -1
let bestT = result.bestTranscription.formattedString
print("\(currentT), \(bestT)")
If I add this to the recognitionTask code in the iOS sample speech app then say "Can you"..."hear this" I get this output:
-1.0, Can
-1.0, Can you
1.5, Can you
-1.0, Hear
-1.0, Hear this
5.82, Hear this
Example logging that I just did under ios18 running their SpokenWord sample app. Bolded timestamps below that you can see inside "bestTranscription" show the timestamp starting at 0, growing, then being reset to zero at "hear" using your example text "Can you ... hear this":
"isfinal = " Optional(false) "speechRecognitionMetadata = " nil "bestTranscription = " Optional(<SFTranscription: 0x302907f30>, formattedString=Can, segments=( "<SFTranscriptionSegment: 0x30035f000>, substringRange={0, 3}, timestamp=0, duration=0.011, confidence=0, substring=Can, alternativeSubstrings=(\n), phoneSequence=, ipaPhoneSequence=, voiceAnalytics=(null)" ), speakingRate=0.000000, averagePauseDuration=0.000000) "isfinal = " Optional(false) "speechRecognitionMetadata = " nil "bestTranscription = " Optional(<SFTranscription: 0x3029f4ff0>, formattedString=Can you, segments=( "<SFTranscriptionSegment: 0x300363180>, substringRange={0, 3}, timestamp=0, duration=0.011, confidence=0, substring=Can, alternativeSubstrings=(\n), phoneSequence=, ipaPhoneSequence=, voiceAnalytics=(null)", "<SFTranscriptionSegment: 0x300362a60>, substringRange={4, 3}, timestamp=0.011, duration=0.011, confidence=0, substring=you, alternativeSubstrings=(\n), phoneSequence=, ipaPhoneSequence=, voiceAnalytics=(null)" ), speakingRate=0.000000, averagePauseDuration=0.000000) "isfinal = " Optional(false) "speechRecognitionMetadata = " Optional(<SFSpeechRecognitionMetadata: 0x302918000>, speakingRate=105.2631578947368, averagePauseDuration=0.63, speechStartTimestamp=0.87, speechDuration=1.14, voiceAnalytics=<SFVoiceAnalytics: 0x302919080>, jitter=<SFAcousticFeature: 0x30276a700>, featureValues=( "0.4987460970878601", "0.3749962747097015", "0.1248437315225601", 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ), frameDuration=0.060000, shimmer=<SFAcousticFeature: 0x30276a720>, featureValues=( "0.8899724914652929", "0.8899724914652929", 0, "0.3784862775788499", "0.3784862775788499", "0.3784862775788499", "0.3784862775788499", "0.2857474289565418", "0.2857474289565418", "0.8161851543713406", "0.8161851543713406", "0.5304377254147988", "0.7906075691773117", "0.2601698437625128", "0.2601698437625128", "0.6728613109400982", "0.4126914671775854", "0.4126914671775854", "0.4126914671775854" ), frameDuration=0.060000, pitch=<SFAcousticFeature: 0x30276a7a0>, featureValues=( "0.007840520702302456", "0.113343857228756", "0.08692201226949692", "0.007010921835899353", "0.0401165559887886", "-0.002225682372227311", "0.03897630423307419", "-0.01068341638892889", "0.03600436076521873", "0.03275604546070099", "-0.008439932018518448", "0.01869481801986694", "0.02345035597681999", "0.09223382174968719", "0.008906735107302666", "0.07839695364236832", "0.02610092982649803", "-0.04739493876695633", "-0.04630550742149353" ), frameDuration=0.060000, voicing=<SFAcousticFeature: 0x30276aa00>, featureValues=( "0.125035896897316", "0.1786487996578217", "0.1792247593402863", "0.1529182344675064", "0.1237559765577316", "0.05911365896463394", "0.006110473070293665", "0.006131512112915516", "0.008723135106265545", "0.007411795202642679", "0.006429999135434628", "0.006139676086604595", "0.005556123796850443", "0.002175267785787582", "0.002166431862860918", "0.003816721495240927", "0.003316614776849747", "0.004189540632069111", "0.004460837226361036" ), frameDuration=0.060000) "bestTranscription = " Optional(<SFTranscription: 0x302918c00>, formattedString=Can you, segments=( "<SFTranscriptionSegment: 0x30036c060>, substringRange={0, 3}, timestamp=0.87, duration=0.63, confidence=0.993, substring=Can, alternativeSubstrings=(\n), phoneSequence=k AA n, ipaPhoneSequence=k.\U02c8\U00e6.n, voiceAnalytics=<SFVoiceAnalytics: 0x302919830>, jitter=<SFAcousticFeature: 0x302768100>, featureValues=(\n \"0.4987460970878601\",\n \"0.3749962747097015\",\n \"0.1248437315225601\",\n 0,\n 0,\n 0,\n 0,\n 0,\n 0,\n 0,\n 0\n), frameDuration=0.060000, shimmer=<SFAcousticFeature: 0x30276a200>, featureValues=(\n \"0.8899724914652929\",\n \"0.8899724914652929\",\n 0,\n \"0.3784862775788499\",\n \"0.3784862775788499\",\n \"0.3784862775788499\",\n \"0.3784862775788499\",\n \"0.2857474289565418\",\n \"0.2857474289565418\",\n \"0.8161851543713406\",\n \"0.8161851543713406\"\n), frameDuration=0.060000, pitch=<SFAcousticFeature: 0x30276a120>, featureValues=(\n \"0.007840520702302456\",\n \"0.113343857228756\",\n \"0.08692201226949692\",\n \"0.007010921835899353\",\n \"0.0401165559887886\",\n \"-0.002225682372227311\",\n \"0.03897630423307419\",\n \"-0.01068341638892889\",\n \"0.03600436076521873\",\n \"0.03275604546070099\",\n \"-0.008439932018518448\"\n), frameDuration=0.060000, voicing=<SFAcousticFeature: 0x302769f40>, featureValues=(\n \"0.125035896897316\",\n \"0.1786487996578217\",\n \"0.1792247593402863\",\n \"0.1529182344675064\",\n \"0.1237559765577316\",\n \"0.05911365896463394\",\n \"0.006110473070293665\",\n \"0.006131512112915516\",\n \"0.008723135106265545\",\n \"0.007411795202642679\",\n \"0.006429999135434628\"\n), frameDuration=0.060000", "<SFTranscriptionSegment: 0x30036c120>, substringRange={4, 3}, timestamp=1.56, duration=0.4500000000000002, confidence=0.993, substring=you, alternativeSubstrings=(\n), phoneSequence=y OOH, ipaPhoneSequence=j.\U02c8u, voiceAnalytics=<SFVoiceAnalytics: 0x302918ba0>, jitter=<SFAcousticFeature: 0x302769d20>, featureValues=(\n 0,\n 0,\n 0,\n 0,\n 0,\n 0,\n 0\n), frameDuration=0.060000, shimmer=<SFAcousticFeature: 0x302769ca0>, featureValues=(\n \"0.7906075691773117\",\n \"0.2601698437625128\",\n \"0.2601698437625128\",\n \"0.6728613109400982\",\n \"0.4126914671775854\",\n \"0.4126914671775854\",\n \"0.4126914671775854\"\n), frameDuration=0.060000, pitch=<SFAcousticFeature: 0x302769c60>, featureValues=(\n \"0.02345035597681999\",\n \"0.09223382174968719\",\n \"0.008906735107302666\",\n \"0.07839695364236832\",\n \"0.02610092982649803\",\n \"-0.04739493876695633\",\n \"-0.04630550742149353\"\n), frameDuration=0.060000, voicing=<SFAcousticFeature: 0x302769be0>, featureValues=(\n \"0.005556123796850443\",\n \"0.002175267785787582\",\n \"0.002166431862860918\",\n \"0.003816721495240927\",\n \"0.003316614776849747\",\n \"0.004189540632069111\",\n \"0.004460837226361036\"\n), frameDuration=0.060000" ), speakingRate=105.263158, averagePauseDuration=0.630000) "isfinal = " Optional(false) "speechRecognitionMetadata = " nil "bestTranscription = " Optional(<SFTranscription: 0x30291e040>, formattedString=Hear, segments=( "<SFTranscriptionSegment: 0x300368060>, substringRange={0, 4}, timestamp=0, duration=0.011, confidence=0, substring=Hear, alternativeSubstrings=(\n), phoneSequence=, ipaPhoneSequence=, voiceAnalytics=(null)" ), speakingRate=0.000000, averagePauseDuration=0.000000) "isfinal = " Optional(false) "speechRecognitionMetadata = " nil "bestTranscription = " Optional(<SFTranscription: 0x302907e10>, formattedString=Hear this, segments=( "<SFTranscriptionSegment: 0x30035efa0>, substringRange={0, 4}, timestamp=0, duration=0.011, confidence=0, substring=Hear, alternativeSubstrings=(\n), phoneSequence=, ipaPhoneSequence=, voiceAnalytics=(null)", "<SFTranscriptionSegment: 0x30035f060>, substringRange={5, 4}, timestamp=0.011, duration=0.011, confidence=0, substring=this, alternativeSubstrings=(\n), phoneSequence=, ipaPhoneSequence=, voiceAnalytics=(null)" ), speakingRate=0.000000, averagePauseDuration=0.000000) "isfinal = " Optional(false) "speechRecognitionMetadata = " Optional(<SFSpeechRecognitionMetadata: 0x302918c00>, speakingRate=142.8571428571429, averagePauseDuration=0.33, speechStartTimestamp=4.56, speechDuration=0.84, voiceAnalytics=<SFVoiceAnalytics: 0x3029197d0>, jitter=<SFAcousticFeature: 0x302768180>, featureValues=( "0.3726694583892822", "0.3740634620189667", "0.2496880143880844", "0.1248437315225601", 0, 0, 0, 0, "0.1249997913837433", "0.2503129839897156", "0.5018765330314636", "0.6251475811004639", "0.748125433921814", "0.7475067973136902" ), frameDuration=0.060000, shimmer=<SFAcousticFeature: 0x302769b20>, featureValues=( "0.2268283237757743", "0.2268283237757743", "0.2268283237757743", "3.480282776022137", "3.253454452246362", "3.253454452246362", "3.253454452246362", 0, 0, "1.418240421350553", "1.418240421350553", "1.418240421350553", "2.308212912815846", "0.8899724914652929" ), frameDuration=0.060000, pitch=<SFAcousticFeature: 0x30276a600>, featureValues=( "-0.187894344329834", "0.02166384644806385", "0.05307484418153763", "0.01391985360532999", "0.1473968625068665", "-0.05098824948072433", "-0.03195502236485481", "-0.05199864506721497", "-0.01371250301599503", "0.0918235182762146", "0.02358327619731426", "0.003420562716200948", "-0.04381118342280388", "-0.03799358010292053" ), frameDuration=0.060000, voicing=<SFAcousticFeature: 0x30276a5c0>, featureValues=( "0.1275853961706161", "0.2178767323493958", "0.1581706404685974", "0.1227026283740997", "0.09937180578708649", "0.07455019652843475", "0.01624933443963528", "0.01390732545405626", "0.01574409566819668", "0.01473238132894039", "0.01467863284051418", "0.01151358522474766", "0.03370866924524307", "0.05931136012077332" ), frameDuration=0.060000) "bestTranscription = " Optional(<SFTranscription: 0x302919890>, formattedString=Hear this, segments=( "<SFTranscriptionSegment: 0x30036c120>, substringRange={0, 4}, timestamp=4.56, duration=0.33, confidence=0.994, substring=Hear, alternativeSubstrings=(\n), phoneSequence=h EE r, ipaPhoneSequence=h.\U02c8i.\U027b, voiceAnalytics=<SFVoiceAnalytics: 0x302919620>, jitter=<SFAcousticFeature: 0x302769ca0>, featureValues=(\n \"0.3726694583892822\",\n \"0.3740634620189667\",\n \"0.2496880143880844\",\n \"0.1248437315225601\",\n 0\n), frameDuration=0.060000, shimmer=<SFAcousticFeature: 0x302769d20>, featureValues=(\n \"0.2268283237757743\",\n \"0.2268283237757743\",\n \"0.2268283237757743\",\n \"3.480282776022137\",\n \"3.253454452246362\"\n), frameDuration=0.060000, pitch=<SFAcousticFeature: 0x302769c20>, featureValues=(\n \"-0.187894344329834\",\n \"0.02166384644806385\",\n \"0.05307484418153763\",\n \"0.01391985360532999\",\n \"0.1473968625068665\"\n), frameDuration=0.060000, voicing=<SFAcousticFeature: 0x302769b00>, featureValues=(\n \"0.1275853961706161\",\n \"0.2178767323493958\",\n \"0.1581706404685974\",\n \"0.1227026283740997\",\n \"0.09937180578708649\"\n), frameDuration=0.060000", "<SFTranscriptionSegment: 0x30036c060>, substringRange={5, 4}, timestamp=4.89, duration=0.51, confidence=0.992, substring=this, alternativeSubstrings=(\n), phoneSequence=dh IH s, ipaPhoneSequence=\U00f0.\U02c8\U026a.s, voiceAnalytics=<SFVoiceAnalytics: 0x302919710>, jitter=<SFAcousticFeature: 0x302769b40>, featureValues=(\n 0,\n 0,\n 0,\n \"0.1249997913837433\",\n \"0.2503129839897156\",\n \"0.5018765330314636\",\n \"0.6251475811004639\",\n \"0.748125433921814\",\n \"0.7475067973136902\"\n), frameDuration=0.060000, shimmer=<SFAcousticFeature: 0x302769a00>, featureValues=(\n \"3.253454452246362\",\n \"3.253454452246362\",\n 0,\n 0,\n \"1.418240421350553\",\n \"1.418240421350553\",\n \"1.418240421350553\",\n \"2.308212912815846\",\n \"0.8899724914652929\"\n), frameDuration=0.060000, pitch=<SFAcousticFeature: 0x302769fe0>, featureValues=(\n \"-0.05098824948072433\",\n \"-0.03195502236485481\",\n \"-0.05199864506721497\",\n \"-0.01371250301599503\",\n \"0.0918235182762146\",\n \"0.02358327619731426\",\n \"0.003420562716200948\",\n \"-0.04381118342280388\",\n \"-0.03799358010292053\"\n), frameDuration=0.060000, voicing=<SFAcousticFeature: 0x30276a0e0>, featureValues=(\n \"0.07455019652843475\",\n \"0.01624933443963528\",\n \"0.01390732545405626\",\n \"0.01574409566819668\",\n \"0.01473238132894039\",\n \"0.01467863284051418\",\n \"0.01151358522474766\",\n \"0.03370866924524307\",\n \"0.05931136012077332\"\n), frameDuration=0.060000" ), speakingRate=142.857143, averagePauseDuration=0.330000) "isfinal = " Optional(true) "speechRecognitionMetadata = " nil "bestTranscription = " Optional(<SFTranscription: 0x3029f4fc0>, formattedString=, segments=( "<SFTranscriptionSegment: 0x300361920>, substringRange={0, 0}, timestamp=0, duration=0, confidence=0, substring=, alternativeSubstrings=(\n), phoneSequence=, ipaPhoneSequence=, voiceAnalytics=(null)" ), speakingRate=0.000000, averagePauseDuration=0.000000) Received an error while accessing com.apple.speech.localspeechrecognition service: Error Domain=kAFAssistantErrorDomain Code=1101 "(null)"
Btw, more folks are reporting this issue over at https://developer.apple.com/forums/thread/762952?login=true&page=1#804516022. I've been updating it as well.
Thanks, that's really helpful @righteoustales. Looking at the bestTranscription
provides more information. I now have this:
let hasMetadata = result.speechRecognitionMetadata != nil
let currentT = result.bestTranscription.segments.first?.timestamp ?? -1
let bestT = result.bestTranscription.formattedString
print("\(currentT), \(hasMetadata), \(bestT)")
And here's the output:
0.0, false, Can
0.0, false, Can you
1.95, true, Can you
0.0, false, Hear
0.0, false, Hear this
6.029999999999999, true, Hear this
0.0, false,
So I think the algorithm is that a non-null metadata or a non-zero timestamp is the reset point. The transcription at that point should be kept and new incoming transcriptions after that point should be added to it.
I have a working version using this algorithm, two issues came up as I was testing it. The first is that I'm currently using auto punctuation, to make this work it would have to be turned off otherwise recognition adds punctuation after the first section which ends up making the whole utterance this: "Can you? Hear this" which probably isn't what's desired. Changing the punctuation setting would be a change from current behaviour. I guess ideally I would make it user selectable. You can manually add punctuation by saying question mark etc.
The second issue is capitalization. Even with auto punctuation off the beginning of the second transcription "Hear this" is being capitalized. It still seems better than losing the first transcription completely but it's not ideal. So with auto punctuation off I can now get the somewhat correct output "Can you Hear this" even with a lengthy pause. If you don't pause you get the expected phrase.
My bad, turns out autoPunctuation is already controllable through the API. So this works best with it off.
Also, currently I'm adding a space between subsequent transcriptions. This does make words separate at least but in the use case that started this thread the result would be "1+2+3+4 +5" which is still better than just "+5" but sub-optimal. The real fix is for Apple to start doing this correctly again but until then I guess trade-offs are the order of the day.
Version 7.1.0-beta.1 is available now on pub.dev if anyone wants to try it. Suggestions for improvement would be gratefully received. Thanks for all the help so far!
One possible change I've considered would be to hide this behaviour behind a feature flag but the iOS bug seems severe enough that I don't think anyone would want the original behaviour. I guess there is some possibility that this isn't a bug but that Apple has decided this is better behaviour? In that case the mitigation should be optional. Thoughts?
I suggest a flag to disable it, but keep default behaviour to fix it.
I will test the beta later today, thank you for putting it together
One question is around backwards compatibility - will this change the results on older OS versions?
It should not change the result on previous versions, at least I hope not. If the behaviour of the metadata has changed significantly then I suppose it might.
I built my code using your beta. It is working exactly as you described above and, for my specific usage which is pretty domain-specific, it seems to solve the lost words issue.
I do considerable post-processing of the text returned to ensure valid grammar which is a mix of my app domain overall and also a state machine that does further validation depending upon what has been spoken already. So, neither the lack of punctuation nor the inconsistent capitalization matters in my case. But losing words was a non-starter.
That said, for the more general cases (like dictation of thoughts where pausing is common), I think what you have done is about as good as it can get without Apple fixing their bug (hopefully they concur that it is a bug). On that, if you look at the issue flutterrocks shared above (https://forums.developer.apple.com/forums/thread/762952), there is evidence that Apple is taking notice and also a request for others to file the bug via their feedback website to increase attention. I'd recommend we all do that.
Also, @sowens-csd, thank you for your prompt attention to this issue. Very appreciated.
I just tried it on iOS 16.7.4 and it seems fine. So I think it is fairly backward compatible. If anyone has a chance to test it on other versions that would be helpful.
I've loaded in the beta version but it seems that the experience is the same broken one, there might be some caching going on so I'll test around some more.
Edit: still experiencing the broken behaviour
I've loaded in the beta version but it seems that the experience is the same broken one, there might be some caching going on so I'll test around some more.
Edit: still experiencing the broken behaviour
Interesting. I'm not seeing that. In my testing, I didn't have any dropped words at all so far. I'm wondering what is different.
UPDATE: does 'flutter pub deps' show the correct library version included as specified in your pubspec.yaml?
Thank you for addressing this issue. I tested with version 7.1.0-beta.1 on iOS 17.6.1 and iOS 18.0, and I did not experience any word drops. It worked well in my environment.
@sowens-csd This popped up today on Stack Overflow. Sharing in case it is useful for comparison.
Context: flutter 3.16.2 on IOS (iphone12 running 17.6.1) using speech_to_text (6.6.2).
With a listen call set with options as follows: SpeechListenOptions options = SpeechListenOptions( listenMode: ListenMode.dictation, partialResults: true, onDevice: true, ); await _speechToText.listen(onResult: _onSpeechResult, listenOptions: options);
I am seeing the buffer of words returned via: void '_onSpeechResult(SpeechRecognitionResult result)'
get reset (all words deleted) before the listen times out. This happens if there is a short pause between words spoken - not a long pause at all, maybe 2 seconds at most.
For example, if I speak "add 1+2+3+4 (brief pause)+5", the words returned up until the pause are "add 1+2+3+4", but after the pause the SpeechRecognitionResult is reset and returns "+5" only.
The listen is active throughout this (ie. didn't stop) I check result.isFinal and it is set to 'false' for each callback above as well.
Is this normal? Any idea how to prevent it or if preventing it isn't possible how to recognize when it is occurring so I can code around it?
Thanks in advance. -Gerald