Closed emcodem closed 1 year ago
A few tests with the workaround mentioned are done now, it looks like we still get repeated text output but it seems to be a lot better than before, e.g. the issue does not occur for a full hour but only for 1 minute duration, then it catches up with the content. I feel that we would need to re-initalize or flush with zero data instead of just seeking. OR of course we get out what is actually causing the bug but i have no idea yet how to do that.
I'm gonna leave this here because it's relevant but I don't know how this fix would translate into this Windows port, as I don't see any way to edit the parameters.
For my use case (transcribing from microphone with c# NuGet package) setting the max-context
parameter to 0 worked for me to stop text from repeating. In the example code for c# max-context
is set to -1 by default.
Relevant Posts that mention this solution: https://github.com/ggerganov/whisper.cpp/issues/508 https://github.com/ggerganov/whisper.cpp/discussions/408#discussioncomment-4780401
thanks @VRCWizard that opened a lot of insights for me. I believe setting the max-content (-mc) option to 0 is the equivalent of the mentioned condition_on_previous_text of the python implementation which they propose as a solution (the issue @GrahamboJangles mentioned above). First testing shows good results and i believe to see that when -mc is 0, instead of getting a loop, i get a warning message "failed to generate timestamp token - skipping one second" at the spot where it normally would start repeating the output.
Also i played with adopting this into my workaround, e.g. i leave the -mc at default in order to get improved detection quality but when i detect repeated output, i clear the history instead of seeking:
if (0 == strncmp(lasttext->c_str(), text.c_str(), text.length())) {
logDebug(u8"last text repeated, clearing token/prompt past");
prompt_past.clear();
}
This also seems to work and it feels much better than seeking.
EDIT: the following sentences are superseeded by new info, which i posted later. I far away from understanding everything in depth but i have a feeling that the program should take more care about this prompt_past, e.g. maybe the past should also be cleared if there was nothing decoded for a certain time period. Is it really helpful to feed the past of 30 seconds ago? Also, what about Music and silence, is it really helpful to keep a past when there was a clear break of conversation/speech?
Also, if i see correctly, the default setting for this is really high:
ContextImpl.misc.cpp:
rdi->n_max_text_ctx = 16384;
I don't know what exactly that is, tokens or characters or something?
Anyway, this explains why the issue does not happen for me when feeding starting from -1 minute from the spot of interest but it does happen when feeding -2 minutes. As far as i understand this "context" or "past" stuff helps the model to get better results but is it really helpful or foreseen that the past is that long?
Just for reference, here is the original description of the python implementation on the subject:
condition_on_previous_text: bool
if True, the previous output of the model is provided as a prompt for the next window;
disabling may make the text inconsistent across windows, but the model becomes less prone to
getting stuck in a failure loop, such as repetition looping or timestamps going out of sync.
This sounds like the stuff is only helpful within one sentence, not within the past 10 minutes...
It seems like there's an attempt to fix this repetition issue in the official OpenAI-whisper: https://github.com/openai/whisper/pull/1052
Just adding some relevant informations. Just found that openai says " The model will only consider the final 224 tokens of the prompt and ignore anything earlier." https://platform.openai.com/docs/guides/speech-to-text/longer-inputs That seconds my assumption that the default prompt size 16384 that i mention above might be a little bit too high. Also, openai provides the option to submit a prompt in text form which i find very interesting. I ask myself if they keep the same prompt over the whole session or they have the same strategy as const-me version does and just replace the prompt by the most recent tokens from the currently running transcription. EDIT: i just saw that we basically never take thee 16384 default prompt size because the decision in run_full_impl is like "take the lower number, either the -mc userinput or the model.parameters.n_text_ctx/2 --> which is 224. If i am correct, It would be worth a mention in the commandline prameters that the default AND max value of -mc is 224.
On another topic, yet i cannot 100% confirm that clearing the prompt past always helps when the output starts repeating, i hope i'll be able to confirm that soon.
As most readers here might not be able to add the changes i mentioned above in the code for testing, i thought it might be a good idea to share what i am working with. The result is not perfect, e.g. i still have repeated text but it does not repeat forever at least. Also, it can only workaround "repeated" text but not other prompt related issues like the model repeating to output a count 11, 12, 13... forever because the human voice counted from 1 to 10. So the base issue stays of course, it is just a pill for the pain but it doesnt heal the wound.
Anyone who feels adventurous, please try this main.exe, i use it in production. It helps additionally when you set -mc 223 as i mention above, it seems to be what openai does too. Also, as mentioned above, feeding 48khz wav files instead of 16khz (only the original whispercpp is limited to 16khz) seems to help too.
Whisper_last_text_repeated_workaround.zip
Source code changes: 0001-test-for-detect-repeated-text-and-reset-prompt-past.patch
As most readers here might not be able to add the changes i mentioned above in the code for testing, i thought it might be a good idea to share what i am working with. The result is not perfect, e.g. i still have repeated text but it does not repeat forever at least. Also, it can only workaround "repeated" text but not other prompt related issues like the model repeating to output a count 11, 12, 13... forever because the human voice counted from 1 to 10. So the base issue stays of course, it is just a pill for the pain but it doesnt heal the wound.
Anyone who feels adventurous, please try this main.exe, i use it in production. It helps additionally when you set -mc 223 as i mention above, it seems to be what openai does too. Also, as mentioned above, feeding 48khz wav files instead of 16khz (only the original whispercpp is limited to 16khz) seems to help too.
Whisper_last_text_repeated_workaround.zip
Source code changes: 0001-test-for-detect-repeated-text-and-reset-prompt-past.patch
-mc 223 or 224?
As most readers here might not be able to add the changes i mentioned above in the code for testing, i thought it might be a good idea to share what i am working with. The result is not perfect, e.g. i still have repeated text but it does not repeat forever at least. Also, it can only workaround "repeated" text but not other prompt related issues like the model repeating to output a count 11, 12, 13... forever because the human voice counted from 1 to 10. So the base issue stays of course, it is just a pill for the pain but it doesnt heal the wound.
Anyone who feels adventurous, please try this main.exe, i use it in production. It helps additionally when you set -mc 223 as i mention above, it seems to be what openai does too. Also, as mentioned above, feeding 48khz wav files instead of 16khz (only the original whispercpp is limited to 16khz) seems to help too.
Whisper_last_text_repeated_workaround.zip
Source code changes: 0001-test-for-detect-repeated-text-and-reset-prompt-past.patch
tried it out and it does catch repeats and refreshes, definitely a good fix. was able to do a 3+ hour video without it getting perma stuck! Also helps repeating in live transcribing aswell
Yeah, I also tested it, and it caught a bunch of repeats and restarted without issues.
I'm not sure how much different it is than just running -mc 0
as it's really hard to compare with the really long examples I used to test with. Certainly though it spits out a message about repeated lines and does some cleanup.
Either way it's a huge improvement over the base version as it stands, so thank you for making it.
@eagleftw023
-mc 223 or 224?
Honestly very good question. I wrote 223 on purpose because in my last test it made a difference to 224 so i thought we must start counting at 0 so 223 is actually 224 (pretty sure i'm wrong here), but i also wrote above that i belive to have found out that it will be 224 as maxium anyway, no matter if you put a higher number. Also, there is the question as albino1 says, it is not yet 100% clear if -mc > 0 makes any difference at all, especially for stuff that cannot easily be put into context like a movie (compared to news or some discussion, in movies anything can be spoken anytime, so whats kind of context could help to get it right)...
Are you still getting runFullImpl: failed to generate timestamp token - skipping one second?
yup, i didnt really look into that part yet, its a minor issue for me
How you transcribe again the parts that skipped when this occur runFullImpl: failed to generate timestamp token - skipping one second? or when it catch repeats
i don't, i'm happy to get 99% output of a 3 hour clip :D In my case, a human corrects the stuff afterwards anyway, the timecodes and lots of words are garbish in any case.
Just adding another puzzle piece from the whisper.cpp project but this is only for the last 500 samples, not very helpful i fear (except for streaming mode maybe?)
// if there is a very short audio segment left to process, we remove any past prompt since it tends
// to confuse the decoder and often make it repeat or hallucinate stuff
if (seek > seek_start && seek + 500 >= seek_end) {
prompt_past.clear();
}
do you update in source code? if yes can you upload the updated one like the previous you uploaded?
Update: the patch above works but we can do much better. This updated version only deletes a single token from the prompt past instead of the whole prompt which can potentially keep up accuracy. Also i strongly believe we must check if we compare "empty" text because it fired far too often. Not sure if it is foreseen that we have the text empty but it is what i saw in production pretty often. Further thoughts: maybe we should also check if the length of the texts matches because currently it can als hit if only a substring matches.
Let's just hope we don't get into a repeating "empty" text output loop because this one would not be fixed anymore.
if (!prompt_past.empty()){
if (lasttext->length() > 0 && 0 == strncmp(lasttext->c_str(), text.c_str(), text.length())) {
logDebug(u8"Last text repeated ('%s'), clearing token/prompt past. Prompt size before: %d", lasttext->c_str(),prompt_past.size());
prompt_past.erase(prompt_past.begin());
logDebug(u8"Size after: %d", prompt_past.size());
}
}
delete lasttext;
lasttext = new std::string(text);
and @eagleftw023 sorry for not replying to your request but id wouldnt really make sense to publish a custom build with the lines i posted above i guess because it only helps in the last x milliseconds. (hmmm maybe it helps in live mode but not sure about that)
Update: the patch above works but we can do much better. This updated version only deletes a single token from the prompt past instead of the whole prompt which can potentially keep up accuracy. Also i strongly believe we must check if we compare "empty" text because it fired far too often. Not sure if it is foreseen that we have the text empty but it is what i saw in production pretty often. Further thoughts: maybe we should also check if the length of the texts matches because currently it can als hit if only a substring matches.
Let's just hope we don't get into a repeating "empty" text output loop because this one would not be fixed anymore.
if (!prompt_past.empty()){ if (lasttext->length() > 0 && 0 == strncmp(lasttext->c_str(), text.c_str(), text.length())) { logDebug(u8"Last text repeated ('%s'), clearing token/prompt past. Prompt size before: %d", lasttext->c_str(),prompt_past.size()); prompt_past.erase(prompt_past.begin()); logDebug(u8"Size after: %d", prompt_past.size()); } } delete lasttext; lasttext = new std::string(text);
and @eagleftw023 sorry for not replying to your request but id wouldnt really make sense to publish a custom build with the lines i posted above i guess because it only helps in the last x milliseconds. (hmmm maybe it helps in live mode but not sure about that)
Thanks alot,can you share the new patch with exe?
Nah sorry, i just realize this code was totally garbage, it needs to remove much more from the history. I am currently testing a more sophisticated method and post here when it works.
From my current understanding, the whole problem is coming from whisper model for some reason gives back the same text twice. This is not a problem as such but our program collects all the output and feeds it as prompt for the next text generation. The prompt from model perspective can be referred to as "these words and lowercase/punctuation style is important for you".
Basically it is no problem if the last text is used as "prompt" for the model but in case the same words/sentences are used in the prompt, we kind of force the model to output only this.
So as far as i understand is very similar like this: https://www.youtube.com/watch?v=wpSr-ZOLYGA
Anyway, i'll update here as soon as i have a working update.
I gave up on Const-me and base whisper.cpp when I found this whisper fork which seems to work with minimal repetition issues:
https://github.com/Purfview/whisper-standalone-win/
Maybe it's not as accurate, I don't know, but it's CUDA accelerated, so it's still very fast, and it works without having to spend ages cleaning it up afterward because of hallucinations.
As most readers here might not be able to add the changes i mentioned above in the code for testing, i thought it might be a good idea to share what i am working with. The result is not perfect, e.g. i still have repeated text but it does not repeat forever at least. Also, it can only workaround "repeated" text but not other prompt related issues like the model repeating to output a count 11, 12, 13... forever because the human voice counted from 1 to 10. So the base issue stays of course, it is just a pill for the pain but it doesnt heal the wound.
Anyone who feels adventurous, please try this main.exe, i use it in production. It helps additionally when you set -mc 223 as i mention above, it seems to be what openai does too. Also, as mentioned above, feeding 48khz wav files instead of 16khz (only the original whispercpp is limited to 16khz) seems to help too.
Whisper_last_text_repeated_workaround.zip
Source code changes: 0001-test-for-detect-repeated-text-and-reset-prompt-past.patch
Can we directly run the exe file to achieve the purpose of modifying this parameter
@zlbbme sorry not sure if i understand the question. The .zip file download i provided contains my modification, there is no parameter to turn this off or on...
OK so here my youngest attempt.
Download Whisper Desktop and main.exe for users: WhisperCM_prevent-and-remove-duplicated-tokens-from-prompt.zip
Instead of just detecting repeated text, i now prevent repeated stuff to be put into prompt always and IF something repeated AND it is already contained in the prompt, i remove the sentence from prompt. From my current experiments, it should be working. The benefits over the strategy above is that we delete from prompt history much less frequent and only when really needed.
_Developer note: also i prevent pushing any non text tokens into prompt, not sure if that leads to the model not outputting any timecode tokens at all anymore. Also, to logDebug (STDERR), i print the prompt for the next iteration in form of token_text(token_id)token_text(tokenid). If you see a space anywhere, it is actually part of the token. This has proven to be very helpful in order to debug problems and find out if it is a prompt related problem or not.
The reasons why i post this kind of unfinished work is that it looks like i will not work a lot with const-me version for a longer time and instead experiment with faster-whisper (which is more for developers, if you want a GUI for all whisper versions, you can e.g use subtitleedit).
However, it currently looks like i will need to apply the exact same mitigation concepts to any other version of whisper because it looks still like all versions except mine above underly the repeat-forever issue. The reason might be that all projects mostly wait for the MAIN whisper project so come up with solutions.
I am sorry for not taking care about the Desktop Version users so far, i do not use Desktop programs for transcibe but the changes i made are valid for the CLI and for Desktop version so it is no additional effort for me to provide the desktop version as well. Actually they share the same whisper.dll anyways...
The patch for developers: 0001-prevent-and-remove-duplicated-tokens-from-prompt.patch Unfortunately, the patch was made starting from another custom version so there is no way you can apply the patch to existing source code. But it shows the ideas and ways how i did it for documentation purpose mainly.
Last but not least, i am sure this kind of troubles will settle over time. More and more puzzle pieces are coming in from month to month, in forms of code commits in the original whisper project but lots of research work is still to be done. I am sure over time, they will have found adequate solutions to prevent forever repeated stuff in a much better way than i do currently.
If there are is no more activity in this thread, i will close it in a few weeks because it is not relevant for me anymore. There are enough other issues about the same topic open and i don't think it is really helpful to keep this issue with my beginners experiences open forever.
https://github.com/Const-me/Whisper/issues/26#issuecomment-1664575228
@Const-me Is this safe to use, emcodem's duplicated-tokens-from-prompt.zip? Just being cautious because I'm a noob.
@softlypink if you are asking for "security/virus" related reasons, there is no way for anyone to tell if any compiled exe file is safe to use. Antivius software attempts to do this but from my perspective with only very limited success. What A/V Software can do for you is to detect "known" virusses/behaviour but in this case you face an exe that maybe 10-20 people are using so in this case even your Antivirus might have a hard life.
If you want to be sure about security, there is no way around reading (and understanding) the whole source code of the original program, then apply the changes i posted above and compile your own version.
It is only about you to decide what sources for compiled stuff is ok to use and what not (some people use Linux for that exact reason).
Closed because solved for me in the code i posted above. Final note: all "forever" related issues (might be same sentence, lowercase, counting up forever) can only be solved by taking care about what is being thrown into prompt past. This is why -mc 0 always solves the forever repeat issue without any doubts. These issues are not "model related" because it is a decision of the inference only if and what it uses for prompting the next decoding. There might be some guys out there who understand the "importance" of each prompted token the model uses and it can be solved by playing with these importance values. However i don't dive that deep into the decoding process as the problem is very easy to solve by just detecting repeated output and deleting the prompt in that case. The same concept is used e.g. in faster-whisper, look at the "compression thresshold". The problem with that however is that it only clears the prompt in case the "currently decoded output" is too repetitive (instead of the current + the last decoded output)
Usually when we face the forever repeat issue, e.g. output is This This This
In this case, we also have a prompt past like "This This This", so the "model" thinks that "This" is very important for us because we have it hundred times in prompt past. Obviously the confidence of the prompted tokens is much higher than any "really detected" tokens confidence so we are getting into a loop. But again, the inference decided to throw "This" 100 times into prompt past, the model is not guilty for that.
好的,这是我最年轻的尝试。
为用户下载 Whisper Desktop 和 main.exe:WhisperCM_prevent-and-remove-duplicated-tokens-from-prompt.zip
我现在不仅仅是检测重复的文本,而是防止重复的内容总是被放入提示符中,如果重复的内容并且它已经包含在提示符中,我就会从提示符中删除句子。从我目前的实验来看,它应该有效。与上述策略相比,我们的好处是我们从提示历史记录中删除的频率要低得多,并且仅在真正需要时删除。
_开发者说明:我还防止将任何非文本令牌推送到提示符中,不确定这是否会导致模型不再输出任何时间码令牌。此外,对于 logDebug (STDERR),我以 token_text(token_id)token_text(tokenid) 的形式打印下一次迭代的提示。如果您在任何地方看到空格,它实际上是令牌的一部分。事实证明,这对于调试问题并找出它是否是提示相关问题非常有帮助。
我发布这种未完成的工作的原因是,看起来我不会在更长的时间内使用 const-me 版本,而是尝试使用 faster-whisper(这对开发人员来说更多,如果你想要一个适用于所有 whisper 版本的 GUI,你可以使用 subtitleedit)。
但是,目前看起来我需要将完全相同的缓解概念应用于任何其他版本的 whisper,因为它看起来仍然像上面我之外的所有版本,这是永远重复问题的基础。原因可能是所有项目大多都在等待 MAIN 耳语项目,所以想出解决方案。
很抱歉到目前为止没有关心桌面版用户,我没有使用桌面程序进行 transcibe,但我所做的更改对 CLI 和桌面版本有效,因此我不需要额外努力提供桌面版本。其实他们本来是一样的whisper.dll......
面向开发人员的补丁:0001-prevent-and-remove-duplicated-tokens-from-prompt.patch遗憾的是,该补丁是从另一个自定义版本开始制作的,因此您无法将补丁应用于现有源代码。但它展示了我主要出于文档目的而做的想法和方法。
最后但并非最不重要的一点是,我相信这种麻烦会随着时间的推移而解决。每个月都有越来越多的拼图以原始 whisper 项目中的代码提交形式出现,但仍有许多研究工作要做。我相信随着时间的推移,他们会找到适当的解决方案,以比我现在更好的方式防止永远重复的事情。
如果这个线程中没有更多活动,我会在几周内关闭它,因为它不再与我相关。关于同一主题,还有足够多的其他问题悬而未决,我认为在我的初学者经验中永远保持这个问题并没有真正的帮助。
May I ask about the function of main.exe? If I want to solve the problem of duplication, do I just need to download and overwrite the original file without running mian.exe?
Also mentioned here: https://github.com/Const-me/Whisper/issues/23
I ran a few tests on longer video clips (e.g. 2 hours) and mostly it tends to repeat the sentence from a certain point until the end. E.g. after one hour, you see repeated output forever. The timestamps seem to indicate that there was a new text detected but the text content is the same as before.
https://1drv.ms/u/s!AkS-A9Jqq09FgzEX78lvh7SiMAYu?e=C6f8GW
In this example, after about 2 minutes, the sentence repeats: "Jagt mich mal mit frahmen nudelholz". The exact command that i use: C:\dev\whisper\Whisper\x64\Release\main.exe -f C:\temp\test.wav -l de -m C:\temp\whisper\ggml-large.bin
I did different tests and cut portions of some affected file and it turns out that it is not caused by the audio content itself because it the affected area will translate just fine if i e.g. cut 1 minute before it but if i leave 2 minutes before, it happens.
In ContextImpl.cpp, i tried to catch "repeated text" by just copying the latest "text" to the heap as soon as it is complete (about line 740) and before that, compare if the text is the same as last time. If yes, seek a little. lasttext = new std::string(text);
This seems to "workaround the issue" (still needs lot of testing) but i am really not sure if this is the correct way to do it. Also, a question. Is it correct to do such workarounds there (there is another workaround a few lines above) or should the cause be searched and fixed somewhere else? (where)