aiaimimi0920 commented 10 months ago

@fire @Ughuuu The current example is a very short repetitive sentence, so it is not possible to test for text loss issues.

When I tried to use this plugin in daily life, I found that text loss problems occurred when I entered long and different sections of text.

like this test wav: Test example 1:

model: small.en.bin
lanuage: English
use_gpu:true

Source text:

As a five-year-old girl ,Lin Qiaozhi was deeply affected by her mother's death.
At age 18, instead of following the traditional path of marriage like the majority of girls, 
she chose to study medicine. 
"Why should we go through so much?  Finally a good husband should be their final goal!" 
her brother complained, thinking of the high tuition fees.
she responded, "I'd rather stay single to study all my life!"

result text：

Test example 2:

model: small.bin
lanuage: Chinese
use_gpu:true

Source text:

骑士的战马的头部经过长戟手正前方，长戟手果断出手，长戟伸向骑士的身侧。
当长戟碰到骑士手臂的时候，一股巨大的力量瞬间通过长期同时作用于它和长戟手。
长棘手的双手如同铁钳一般牢牢攥紧木柄，双脚就像树根一样死死扎在地上。
上一秒，骑士以为自己将要得救。
下一秒，他就被长戟手从马上拖了下来。
骑士感觉他像是在飞，短暂的飞翔之后，他重重摔在松软的草地上。

result text：

test_wav.zip

Based on the above example, it can be seen that there is a problem of missing text, which may be caused by my PR?

I will see how to solve this problem. If you have any good suggestions, please let me know

Ughuuu commented 10 months ago

Interesting. I am trying locally too to see if I can find anything relevant. In my short tests it worked ok.

aiaimimi0920 commented 10 months ago

You can try "english_test.wav" in the zip file I attached, where "english_test.txt" is the original text

Ughuuu commented 10 months ago

I'm thinking first I'll expose all params to scripting, and then I'll see what generates the issue. Could the the duration_ms that is set by default to 5000 ms(5 seconds)

Ughuuu commented 10 months ago

Or maybe problem is it doesnt detect end of sentence correctly.

This part could also drop text, and maybe we should make it configurable:

if (token.p > 0.6 && token.plog < -0.5) {
                        continue;
                    }

Ughuuu commented 10 months ago

The problem occurs in gdscript after all, here:

var cur_text = transcribed_msg["text"]
var token_index = cur_text.rfind("]")
if token_index!=-1:
    cur_text = cur_text.substr(token_index+1)

With this example:

[{ "is_partial": false, "text": " As a 5-year-old girl, Linozu was deeply affected by her mother\'s death.[_TT_388] At age 18, instead of following the traditional path of marriage like the majority of girls," }]

What happens here, it looks for a ] and it finds it, but since this sentence contains two sentences for some reason, it will ignore the first part. Instead, it should just try to remove from the text the special character, and not skip anything. Testing locally to see if it fixes it.

aiaimimi0920 commented 10 months ago

The purpose of writing this code at that time was because I thought every "transferred_msg" contains only one sentence, and I would like to remove the markers at the beginning and end.

So will there be a phenomenon where a sentence contains multiple identifiers?

Ughuuu commented 10 months ago

Yes, of course, I understand why the code is like this. The phenomenon seems to be with the character [_TT_388]. But nevertheless, we don't really care about any such characters, so I'm thinking to just remove all things that start with [ and end with ], possibly with a regex instead that we update by time(eg. we can add to that regex also < >)

aiaimimi0920 commented 10 months ago

this code use the [[_TT_xxx]: https://github.com/ggerganov/whisper.cpp/blob/6ebba525f1cc9393752906023a3385a2cc8062ed/whisper.cpp#L1261

Could not find the introduction document, but it appears to be marking the beginning of a sentence?
Are there any of these tags that we may need to use?

Ughuuu commented 10 months ago

I doubt it, we don't need these tags for anything. But anyway, am only removing them in gdscript, where we process the text, not in the server.

Also, another issue I found(maybe this is the one that happens, though I had multiple things happen) is:

eg you have 4 transcribed messages:

[{ "is_partial": true, "text": " As a-year-old girl,ju was deeply affected by her mother\'s death. At age 18, he started offering the traditional of." }]
[{ "is_partial": true, "text": " As a 5-year-old girl,ju was deeply affected by her mother\'s death at age 18" }]
[{ "is_partial": true, "text": " As a 5-year-old girl,ju was deeply affected by her mother\'s death. At age 18, he started offering the traditional of marriage like the majority of" }]
[{ "is_partial": false, "text": " As a 5-year-old girl,ozu was deeply affected by her mother\'s death." }]

If the first one, it transcribed a large sentence, but it doesn't know it's 2 sentences. Then, it doesn't recognise the second sentence.(still partial though) Then, it recognises it again, then it ends the first sentence, but it doesn't continue the second one at all.

aiaimimi0920 commented 10 months ago

I have changed the method, but there are still issues with missing text,

The problem you have discovered may only be a partial cause of this problem.

Attached is my modified code:

        var regex_a = RegEx.new()
        regex_a.compile("\\[[^\\[\\]]*\\]")
        cur_text = regex_a.sub(cur_text,"",true)

        var regex_b = RegEx.new()
        regex_b.compile("\\<[^\\<\\>]*\\>")
        cur_text = regex_b.sub(cur_text,"",true)

aiaimimi0920 commented 10 months ago

我对此表示怀疑，我们不需要这些标签做任何事情。但无论如何，我只在 gdscript 中删除它们，我们在那里处理文本，而不是在服务器中。

另外，我发现的另一个问题（也许这就是发生的问题，尽管我发生了很多事情）是：

例如，您有 4 条转录的消息：
[{ "is_partial": true, "text": " As a-year-old girl,ju was deeply affected by her mother\'s death. At age 18, he started offering the traditional of." }]
[{ "is_partial": true, "text": " As a 5-year-old girl,ju was deeply affected by her mother\'s death at age 18" }]
[{ "is_partial": true, "text": " As a 5-year-old girl,ju was deeply affected by her mother\'s death. At age 18, he started offering the traditional of marriage like the majority of" }]
[{ "is_partial": false, "text": " As a 5-year-old girl,ozu was deeply affected by her mother\'s death." }]
如果是第一个，它转录了一个大句子，但它不知道它是 2 个句子。然后，它无法识别第二句话。（虽然仍然是部分的）然后，它再次识别它，然后它结束了第一句话，但它根本没有继续第二句话。

You have explained it very clearly, and this should be the cause of the problem. We need to find a way to solve it

Ughuuu commented 10 months ago

Yup. I am still investigating it. I am guessing it's probably related to whisper_params.single_segment or probably something that makes it process just one sentence.

Ughuuu commented 10 months ago

I think it's related to the duration parameter after all. If I set it to something higher, like 20s, it just works. So I think the duration parameter shouldn't be hardcoded, but instead it should be calculated by the amount of frames it has and run on that.

Ughuuu commented 10 months ago

Ok, I think I have a fix, testing it and will put it on my branch and merge it if all is good.

Ughuuu commented 10 months ago

Put all on this branch. https://github.com/V-Sekai/godot-whisper/pull/39 Must also say, it works, but sometimes it still might cut some things, but less than before maybe. But now also there is interface to set parameters from UI, so that's nice. I would say, if you try to make a bigger break when sentence ends, it will have a higher success rate.

Ughuuu commented 10 months ago

After it builds on main branch u can try again, or u can try building locally. Merged the change.

aiaimimi0920 commented 10 months ago

I found that it may be related to pcmf32. size()>n_samples_iter_threshold, I will try to fix it

Ughuuu commented 10 months ago

Are you sure it's related? What do you think the issue is related to it? Should we make it configurable on node properties and test them?

Also, I was thinking some or all properties should be instead project level settings and not node level settings, as it doesn't make sense to expose all settings to 1 node, and there is just 1 singleton anyway.

aiaimimi0920 commented 10 months ago

@Ughuuu #40 Indeed, it is very likely that the problem is caused by the automatic truncation triggered after pcmf32.size()>n_samples_iter_threshold.

I have made some modifications to this logic.

All text can now be recognized, and there will be no missing text.

I enabled timestamps for this logic. I don’t know if it will have a big impact on performance.

aiaimimi0920 commented 10 months ago

The modification logic is when it is found that pcmf32. size()>n_samples_iter_threshold , PCMF32 is not directly cleared, but all data corresponding to the first segment in PCMF32 is deleted.

and other data is retained for the next inference. After all, the inference time of audio within 30 seconds is the same for Whisper.cpp, so we do not need to delete all data

In most cases, this logic is not a problem, but in some cases, the timestamp returned by Whisper seems to be problematic?

I used token.t0 and token.t1 to obtain timestamps, perhaps there are other ways?

The timestamp issue may be related to the initial silent voice, or it may be related to incomplete implementation of the current timestamp

aiaimimi0920 commented 10 months ago

This is the current test video

https://github.com/V-Sekai/godot-whisper/assets/153103332/a76396b6-c791-4335-9755-26d5bdc217be

Note: I used 1.5.4 wheeler.cpp. It seems that the version upgrade has greatly alleviated the issue of timestamp errors. Perhaps we can consider upgrading the version

Attention: In the current usage process, there is a small probability that the processing time may reach up to 30 seconds. The reason is currently unclear. (Stuck is just a lack of inference results, the application will not be frozen)

Ughuuu commented 10 months ago

Lets upgrade whisper.cpp then.

aiaimimi0920 commented 10 months ago

Lets upgrade whisper.cpp then.

I think it would be more reasonable for you to create the PR for upgrading Whisper.cpp version. I am not sure if you have made any custom modifications to Whistler.CPP before.

But in my testing, simply replacing all files with version 1.5.4 and compiling them can be used normally on Windows

aiaimimi0920 commented 10 months ago

By the way, I posted a issue about timestamp on wheeler.cpp(https://github.com/ggerganov/whisper.cpp/issues/1776) .

If there is any valid response, I will try to continue fixing this issue. Because there is still a probability that too many voice buffers will be discarded due to timestamp errors

Ughuuu commented 10 months ago

@fire can you upgrade whisper cpp

fire commented 10 months ago

I made a pr that did git subrepo pull --branch=master --force thirdparty/whisper.cpp

aiaimimi0920 commented 10 months ago

@fire @Ughuuu I have delved a little deeper into the code of wheeler.cpp and found that although the document mentions single_segment=true is more suitable for stream usage, but if set to false,

it can trigger timestamp updates: https://github.com/V-Sekai/godot-whisper/blob/dc741ff637130dffd038dc72b383c9aba3af8d0b/thirdparty/whisper.cpp/whisper.cpp#L5753C78 -L5753C92

In my actual testing, 9 out of 10 tests were able to achieve near perfect results, with 1 test causing lag for unknown reasons. But there were no issues with incorrect timestamps

42 we need merge it

Ughuuu commented 10 months ago

Awesome. Cant wait to test it out.

aiaimimi0920 commented 10 months ago

with 1 test causing lag for unknown reasons

I think I have found the cause of this problem.

If you click on start_button: Turn on the recording switch. At this time, even if there is no voice content, an array of all zeros will be added every second through "_speech_to_text_singleton.add.audio_buffer (buffer)": buffer: [(0,0), (0,0), (0,0), (0,0)...]

Then the Whisper will generate an illusion when it receives blank speech, and the output will be similar to: msg.text: [BEG] you[_TT_103][BEG] you[_TT_103][BEG] you[_TT_103][BEG] you[_TT_103][BEG] you[_TT_103][BEG] y]

The current truncation logic actually has an implicit premise: When pcmf32. size()>n_samples_iter_threshold, delete the size of the first segment, so you can make pcmf32.size() smaller than n_samples_iter_threshold is then used for the next iteration.

This ensures that the duration of pcmf32 is always less than 30 seconds and also ensures multiple iterations of the sentence, improving its accuracy.

But due to hallucinations, the blank speech is generated as [2BEG_] you [TT-103]: At this point, deleting only the length of the first segment of the speech does not significantly reduce the length of pcmf32, which may be greater than 30 seconds. Additionally, each iteration only deletes one segment of the [1BEG_] you [TT_103] corresponding to the pcmf32 array, and the remaining blank speech can only be deleted next time. This is why sometimes iterations are very slow

In our current test cases,

success case：

Click on start_ Button,
Click on AudioStreamPlayer2.play now
You will obtain a relatively satisfactory inference result

fail case

Click on start_ Button,
Wait for a while, such as 5 seconds
Click on AudioStreamPlayer2.play
You will receive an excessively long delay and there is a probability that you will not be able to obtain the result

@fire @Ughuuu This should be the last problem that needs to be solved for this PR, and I will come up with a solution. This is actually closely related to the illusion caused by silent audio

Ughuuu commented 10 months ago

Nice, kudos for identifying the problem so quick

aiaimimi0920 commented 10 months ago

43 @Ughuuu @fire

Basically solved the problem,

But there is also an implicit issue because the timestamp returned by wheeler.cpp is not very accurate in some cases, which can lead to the direct consequence of cutting too much or too little audio when doing audio cutting. There may be issues with duplicate or missing text in the returned text.

But the probability of this problem occurring is not very high, and it should not be a problem in daily use. If you want to deeply fix this problem. So the possible methods are:

Modify the logic of audio truncation
Waiting for Whisper.cpp to provide DTW

fire commented 10 months ago

Is the time stamp of Godot Engine better? What is DTW?

Ughuuu commented 10 months ago

Open a new task/issue. I am now also testing it locally to see what the issue is. The timestamp he is refering to is actually the token timestamp(eg. at what time interval a token appears).

In this example: I go home.

Assume that the voice is saying those words, and then whisper.cpp generates this data:

Token[0]:

text: [BEG] -- this is usually begin token
timestamp_start: 0ms
timestamp_end: 0ms

Token[1]:

text: I
timestamp_start: 0ms
timestamp_end: 100ms

Token[2]:

text: go
timestamp_start:120ms
timestamp_end: 200ms

Token[3]:

text: home
timestamp_start:230ms
timestamp_end: 400ms

Token[4]:

text: [END] -- this is usually end token, can also be [_TTL_400] or smth like that
timestamp_start:400ms
timestamp_end: 400ms

Ughuuu commented 10 months ago

The issue he is refering to is that sometimes the token_timestamp is not correct. But that is a problem in whisper.cpp, nothing we can fix.

aiaimimi0920 commented 10 months ago

yes, This is a specific example: https://github.com/ggerganov/whisper.cpp/issues/1776

aiaimimi0920 commented 10 months ago

@fire Because real-time inference is required, I need to ensure that the content in PCMF32 is always less than 15 seconds, so that you can obtain an inference result within 1-3 seconds.

So it is crucial to delete some audio content through certain methods to keep it within 15 seconds.

Currently, audio arrays are deleted by obtaining the location to be deleted based on the "audio timestamp of a certain segment" * "sampling rate". Then, due to issues with the timestamp returned in the segment, there were too many or too few audio arrays deleted Finally, there will be inference results of text duplication or text loss apply single_section=false and Whisper1.5.4 partially resolved the issue of incorrect timestamps, but did not fully resolve it. If we continue the idea of deleting audio arrays through timestamps, then we need to wait for this PR. https://github.com/ggerganov/whisper.cpp/pull/1485

If there is a better way, please change it

fire commented 10 months ago

Do you want me to try merging in that pr?

aiaimimi0920 commented 10 months ago

no, according to the author's self description, the PR has not been completed yet

V-Sekai / godot-whisper

Missing some text #38

42 we need merge it

success case：

fail case

43 @Ughuuu @fire