Closed aiaimimi0920 closed 10 months ago
Interesting. I am trying locally too to see if I can find anything relevant. In my short tests it worked ok.
You can try "english_test.wav" in the zip file I attached, where "english_test.txt" is the original text
I'm thinking first I'll expose all params to scripting, and then I'll see what generates the issue. Could the the duration_ms that is set by default to 5000 ms(5 seconds)
Or maybe problem is it doesnt detect end of sentence correctly.
This part could also drop text, and maybe we should make it configurable:
if (token.p > 0.6 && token.plog < -0.5) {
continue;
}
The problem occurs in gdscript after all, here:
var cur_text = transcribed_msg["text"]
var token_index = cur_text.rfind("]")
if token_index!=-1:
cur_text = cur_text.substr(token_index+1)
With this example:
[{ "is_partial": false, "text": " As a 5-year-old girl, Linozu was deeply affected by her mother\'s death.[_TT_388] At age 18, instead of following the traditional path of marriage like the majority of girls," }]
What happens here, it looks for a ] and it finds it, but since this sentence contains two sentences for some reason, it will ignore the first part. Instead, it should just try to remove from the text the special character, and not skip anything. Testing locally to see if it fixes it.
The purpose of writing this code at that time was because I thought every "transferred_msg" contains only one sentence, and I would like to remove the markers at the beginning and end.
So will there be a phenomenon where a sentence contains multiple identifiers?
Yes, of course, I understand why the code is like this. The phenomenon seems to be with the character [_TT_388]. But nevertheless, we don't really care about any such characters, so I'm thinking to just remove all things that start with [ and end with ], possibly with a regex instead that we update by time(eg. we can add to that regex also < >)
this code use the [[_TT_xxx]: https://github.com/ggerganov/whisper.cpp/blob/6ebba525f1cc9393752906023a3385a2cc8062ed/whisper.cpp#L1261
I doubt it, we don't need these tags for anything. But anyway, am only removing them in gdscript, where we process the text, not in the server.
Also, another issue I found(maybe this is the one that happens, though I had multiple things happen) is:
eg you have 4 transcribed messages:
[{ "is_partial": true, "text": " As a-year-old girl,ju was deeply affected by her mother\'s death. At age 18, he started offering the traditional of." }]
[{ "is_partial": true, "text": " As a 5-year-old girl,ju was deeply affected by her mother\'s death at age 18" }]
[{ "is_partial": true, "text": " As a 5-year-old girl,ju was deeply affected by her mother\'s death. At age 18, he started offering the traditional of marriage like the majority of" }]
[{ "is_partial": false, "text": " As a 5-year-old girl,ozu was deeply affected by her mother\'s death." }]
If the first one, it transcribed a large sentence, but it doesn't know it's 2 sentences. Then, it doesn't recognise the second sentence.(still partial though) Then, it recognises it again, then it ends the first sentence, but it doesn't continue the second one at all.
I have changed the method, but there are still issues with missing text,
The problem you have discovered may only be a partial cause of this problem.
Attached is my modified code:
var regex_a = RegEx.new()
regex_a.compile("\\[[^\\[\\]]*\\]")
cur_text = regex_a.sub(cur_text,"",true)
var regex_b = RegEx.new()
regex_b.compile("\\<[^\\<\\>]*\\>")
cur_text = regex_b.sub(cur_text,"",true)
我对此表示怀疑,我们不需要这些标签做任何事情。但无论如何,我只在 gdscript 中删除它们,我们在那里处理文本,而不是在服务器中。
另外,我发现的另一个问题(也许这就是发生的问题,尽管我发生了很多事情)是:
例如,您有 4 条转录的消息:
[{ "is_partial": true, "text": " As a-year-old girl,ju was deeply affected by her mother\'s death. At age 18, he started offering the traditional of." }] [{ "is_partial": true, "text": " As a 5-year-old girl,ju was deeply affected by her mother\'s death at age 18" }] [{ "is_partial": true, "text": " As a 5-year-old girl,ju was deeply affected by her mother\'s death. At age 18, he started offering the traditional of marriage like the majority of" }] [{ "is_partial": false, "text": " As a 5-year-old girl,ozu was deeply affected by her mother\'s death." }]
如果是第一个,它转录了一个大句子,但它不知道它是 2 个句子。然后,它无法识别第二句话。(虽然仍然是部分的)然后,它再次识别它,然后它结束了第一句话,但它根本没有继续第二句话。
You have explained it very clearly, and this should be the cause of the problem. We need to find a way to solve it
Yup. I am still investigating it. I am guessing it's probably related to whisper_params.single_segment or probably something that makes it process just one sentence.
I think it's related to the duration parameter after all. If I set it to something higher, like 20s, it just works. So I think the duration parameter shouldn't be hardcoded, but instead it should be calculated by the amount of frames it has and run on that.
Ok, I think I have a fix, testing it and will put it on my branch and merge it if all is good.
Put all on this branch. https://github.com/V-Sekai/godot-whisper/pull/39 Must also say, it works, but sometimes it still might cut some things, but less than before maybe. But now also there is interface to set parameters from UI, so that's nice. I would say, if you try to make a bigger break when sentence ends, it will have a higher success rate.
After it builds on main branch u can try again, or u can try building locally. Merged the change.
I found that it may be related to pcmf32. size()>n_samples_iter_threshold
,
I will try to fix it
Are you sure it's related? What do you think the issue is related to it? Should we make it configurable on node properties and test them?
Also, I was thinking some or all properties should be instead project level settings and not node level settings, as it doesn't make sense to expose all settings to 1 node, and there is just 1 singleton anyway.
@Ughuuu #40
Indeed, it is very likely that the problem is caused by the automatic truncation triggered after pcmf32.size()>n_samples_iter_threshold
.
I have made some modifications to this logic.
All text can now be recognized, and there will be no missing text.
I enabled timestamps for this logic. I don’t know if it will have a big impact on performance.
The modification logic is when it is found that pcmf32. size()>n_samples_iter_threshold
, PCMF32 is not directly cleared, but all data corresponding to the first segment in PCMF32 is deleted.
and other data is retained for the next inference. After all, the inference time of audio within 30 seconds is the same for Whisper.cpp, so we do not need to delete all data
In most cases, this logic is not a problem, but in some cases, the timestamp returned by Whisper seems to be problematic?
I used token.t0 and token.t1 to obtain timestamps, perhaps there are other ways?
The timestamp issue may be related to the initial silent voice, or it may be related to incomplete implementation of the current timestamp
This is the current test video
https://github.com/V-Sekai/godot-whisper/assets/153103332/a76396b6-c791-4335-9755-26d5bdc217be
Note: I used 1.5.4 wheeler.cpp. It seems that the version upgrade has greatly alleviated the issue of timestamp errors. Perhaps we can consider upgrading the version
Attention: In the current usage process, there is a small probability that the processing time may reach up to 30 seconds. The reason is currently unclear. (Stuck is just a lack of inference results, the application will not be frozen)
Lets upgrade whisper.cpp then.
Lets upgrade whisper.cpp then.
I think it would be more reasonable for you to create the PR for upgrading Whisper.cpp version. I am not sure if you have made any custom modifications to Whistler.CPP before.
But in my testing, simply replacing all files with version 1.5.4 and compiling them can be used normally on Windows
By the way, I posted a issue about timestamp on wheeler.cpp(https://github.com/ggerganov/whisper.cpp/issues/1776) .
If there is any valid response, I will try to continue fixing this issue. Because there is still a probability that too many voice buffers will be discarded due to timestamp errors
@fire can you upgrade whisper cpp
I made a pr that did git subrepo pull --branch=master --force thirdparty/whisper.cpp
@fire @Ughuuu I have delved a little deeper into the code of wheeler.cpp and found that although the document mentions single_segment=true is more suitable for stream usage, but if set to false,
it can trigger timestamp updates: https://github.com/V-Sekai/godot-whisper/blob/dc741ff637130dffd038dc72b383c9aba3af8d0b/thirdparty/whisper.cpp/whisper.cpp#L5753C78 -L5753C92
In my actual testing, 9 out of 10 tests were able to achieve near perfect results, with 1 test causing lag for unknown reasons. But there were no issues with incorrect timestamps
Awesome. Cant wait to test it out.
with 1 test causing lag for unknown reasons
I think I have found the cause of this problem.
If you click on start_button: Turn on the recording switch. At this time, even if there is no voice content, an array of all zeros will be added every second through "_speech_to_text_singleton.add.audio_buffer (buffer)": buffer: [(0,0), (0,0), (0,0), (0,0)...]
Then the Whisper will generate an illusion when it receives blank speech, and the output will be similar to: msg.text: [BEG] you[_TT_103][BEG] you[_TT_103][BEG] you[_TT_103][BEG] you[_TT_103][BEG] you[_TT_103][BEG] y]
The current truncation logic actually has an implicit premise:
When pcmf32. size()>n_samples_iter_threshold
, delete the size of the first segment, so you can make pcmf32.size() smaller than n_samples_iter_threshold is then used for the next iteration.
This ensures that the duration of pcmf32 is always less than 30 seconds and also ensures multiple iterations of the sentence, improving its accuracy.
But due to hallucinations, the blank speech is generated as [2BEG_] you [TT-103]
:
At this point, deleting only the length of the first segment of the speech does not significantly reduce the length of pcmf32, which may be greater than 30 seconds. Additionally, each iteration only deletes one segment of the [1BEG_] you [TT_103] corresponding to the pcmf32 array, and the remaining blank speech can only be deleted next time. This is why sometimes iterations are very slow
In our current test cases,
@fire @Ughuuu This should be the last problem that needs to be solved for this PR, and I will come up with a solution. This is actually closely related to the illusion caused by silent audio
Nice, kudos for identifying the problem so quick
Basically solved the problem,
But there is also an implicit issue because the timestamp returned by wheeler.cpp is not very accurate in some cases, which can lead to the direct consequence of cutting too much or too little audio when doing audio cutting. There may be issues with duplicate or missing text in the returned text.
But the probability of this problem occurring is not very high, and it should not be a problem in daily use. If you want to deeply fix this problem. So the possible methods are:
Is the time stamp of Godot Engine better? What is DTW?
Open a new task/issue. I am now also testing it locally to see what the issue is. The timestamp he is refering to is actually the token timestamp(eg. at what time interval a token appears).
In this example: I go home.
Assume that the voice is saying those words, and then whisper.cpp generates this data:
Token[0]:
Token[1]:
Token[2]:
Token[3]:
Token[4]:
The issue he is refering to is that sometimes the token_timestamp is not correct. But that is a problem in whisper.cpp, nothing we can fix.
yes, This is a specific example: https://github.com/ggerganov/whisper.cpp/issues/1776
@fire Because real-time inference is required, I need to ensure that the content in PCMF32 is always less than 15 seconds, so that you can obtain an inference result within 1-3 seconds.
So it is crucial to delete some audio content through certain methods to keep it within 15 seconds.
Currently, audio arrays are deleted by obtaining the location to be deleted based on the "audio timestamp of a certain segment" * "sampling rate".
Then, due to issues with the timestamp returned in the segment, there were too many or too few audio arrays deleted
Finally, there will be inference results of text duplication or text loss
apply single_section=false
and Whisper1.5.4
partially resolved the issue of incorrect timestamps, but did not fully resolve it.
If we continue the idea of deleting audio arrays through timestamps, then we need to wait for this PR.
https://github.com/ggerganov/whisper.cpp/pull/1485
If there is a better way, please change it
Do you want me to try merging in that pr?
no, according to the author's self description, the PR has not been completed yet
@fire @Ughuuu The current example is a very short repetitive sentence, so it is not possible to test for text loss issues.
When I tried to use this plugin in daily life, I found that text loss problems occurred when I entered long and different sections of text.
like this test wav: Test example 1:
Test example 2:
test_wav.zip
Based on the above example, it can be seen that there is a problem of missing text, which may be caused by my PR?
I will see how to solve this problem. If you have any good suggestions, please let me know