BingLingGroup / autosub

Command-line utility to transcribe/translate from video/audio/subtitles to subtitles
GNU General Public License v2.0
1.98k stars 245 forks source link

Google empty result response bug #89

Closed njukiller closed 4 years ago

njukiller commented 4 years ago

尝试了好几个不同的视频文件,都是这样。但是如果用分割产生的单独音频flac文件,就可以生成字幕。

下面的示例到2%就跳出 Speech-to-Text: 2% |### | ETA: 0:10:19['{"result":[]}', '']

还有好几个文件是 Speech-to-Text: 0% |# | ETA: 1:11:02['{"result":[]}', '']

使用的是windows 0.5.5版

autosub -i M:\movies\1.mp4 -S en-us -y -sc 4 Destination language not provided. Only performing speech recognition. Speech language is the same as the Destination language. Only performing speech recognition.

Convert source file to "C:\Users\xxxx\AppData\Local\Temp\tmp2d5iiagm.wav" to detect audio regions. M:\work\autosub\autosub\ffmpeg.exe -hide_banner -y -i "M:\movies\1.mp4" -vn -ac 1 -ar 48000 "C:\Users\xxxx\AppData\Local\Temp\tmp2d5iiagm.wav" Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'M:\movies\1.mp4': Metadata: major_brand : isom minor_version : 1 compatible_brands: isom creation_time : 2015-08-11T05:50:50.000000Z encoder : Youzimu Auto Encoding Tool copyright : 漏 2015 Youzimu Fansub Duration: 00:59:04.00, start: 0.000000, bitrate: 2631 kb/s Stream #0:0(eng): Video: h264 (Main) (avc1 / 0x31637661), yuv420p, 1280x720 [SAR 1:1 DAR 16:9], 2498 kb/s, 25 fps, 25 tbr, 25 tbn, 50 tbc (default) Metadata: creation_time : 2015-08-12T05:50:48.000000Z encoder : AVC Coding Stream #0:1(eng): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 128 kb/s (default) Metadata: creation_time : 2015-08-11T05:50:59.000000Z Stream mapping: Stream #0:1 -> #0:0 (aac (native) -> pcm_s16le (native)) Press [q] to stop, [?] for help Output #0, wav, to 'C:\Users\xxxx\AppData\Local\Temp\tmp2d5iiagm.wav': Metadata: major_brand : isom minor_version : 1 compatible_brands: isom ICOP : 漏 2015 Youzimu Fansub ISFT : Lavf58.29.100 Stream #0:0(eng): Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, mono, s16, 768 kb/s (default) Metadata: creation_time : 2015-08-11T05:50:59.000000Z encoder : Lavc58.54.100 pcm_s16le size= 332250kB time=00:59:04.00 bitrate= 768.0kbits/s speed=1.01e+03x video:0kB audio:332250kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.000032%

Use ffprobe to check conversion result. M:\work\autosub\autosub\ffprobe.exe C:\Users\xxxx\AppData\Local\Temp\tmp2d5iiagm.wav -show_format -pretty -loglevel quiet [FORMAT] filename=C:\Users\xxxx\AppData\Local\Temp\tmp2d5iiagm.wav nb_streams=1 nb_programs=0 format_name=wav format_long_name=WAV / WAVE (Waveform Audio) start_time=N/A duration=0:59:04.000000 size=324.462996 Mibyte bit_rate=768 Kbit/s probe_score=99 TAG:copyright=漏 2015 Youzimu Fansub TAG:encoder=Lavf58.29.100 [/FORMAT]

Conversion complete. Use Auditok to detect speech regions.

"C:\Users\xxxx\AppData\Local\Temp\tmp2d5iiagm.wav" has been deleted.

Converting speech regions to short-term fragments. Converting: 100% |####################################################################################################################################################################| Time: 0:00:54

Sending short-term fragments to Google Speech V2 API and getting result. Speech-to-Text: 2% |### | ETA: 0:10:19['{"result":[]}', '']

BingLingGroup commented 4 years ago

你处在可以连接到google服务器的环境里吗? 如果不是,你有给终端设置代理吗? 如果没有,可以尝试使用-hsp来设置

njukiller commented 4 years ago

我在墙外,连google没有问题的。我又测试了用gcsv1, 同样的文件,报错

Receive something unexpected: {} Error: Speech-to-text failed.

autosub -i .\1.mp4 -S en-US -sapi gcsv1 -skey Destination language not provided. Only performing speech recognition. Speech language is the same as the Destination language. Only performing speech recognition.

Convert source file to "C:\Users\xxxx\AppData\Local\Temp\tmph4x527m3.wav" to detect audio regions. M:\work\autosub\autosub\ffmpeg.exe -hide_banner -y -i ".\1.mp4" -vn -ac 1 -ar 48000 "C:\Users\xxxx\AppData\Local\Temp\tmph4x527m3.wav" Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '.\1.mp4': Metadata: major_brand : isom minor_version : 1 compatible_brands: isom creation_time : 2015-08-11T05:50:50.000000Z encoder : Youzimu Auto Encoding Tool copyright : 漏 2015 Youzimu Fansub Duration: 00:59:04.00, start: 0.000000, bitrate: 2631 kb/s Stream #0:0(eng): Video: h264 (Main) (avc1 / 0x31637661), yuv420p, 1280x720 [SAR 1:1 DAR 16:9], 2498 kb/s, 25 fps, 25 tbr, 25 tbn, 50 tbc (default) Metadata: creation_time : 2015-08-12T05:50:48.000000Z encoder : AVC Coding Stream #0:1(eng): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 128 kb/s (default) Metadata: creation_time : 2015-08-11T05:50:59.000000Z Stream mapping: Stream #0:1 -> #0:0 (aac (native) -> pcm_s16le (native)) Press [q] to stop, [?] for help Output #0, wav, to 'C:\Users\xxxx\AppData\Local\Temp\tmph4x527m3.wav': Metadata: major_brand : isom minor_version : 1 compatible_brands: isom ICOP : 漏 2015 Youzimu Fansub ISFT : Lavf58.29.100 Stream #0:0(eng): Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, mono, s16, 768 kb/s (default) Metadata: creation_time : 2015-08-11T05:50:59.000000Z encoder : Lavc58.54.100 pcm_s16le size= 332250kB time=00:59:04.00 bitrate= 768.0kbits/s speed=1.07e+03x video:0kB audio:332250kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.000032%

Use ffprobe to check conversion result. M:\work\autosub\autosub\ffprobe.exe C:\Users\xxxx\AppData\Local\Temp\tmph4x527m3.wav -show_format -pretty -loglevel quiet [FORMAT] filename=C:\Users\xxxx\AppData\Local\Temp\tmph4x527m3.wav nb_streams=1 nb_programs=0 format_name=wav format_long_name=WAV / WAVE (Waveform Audio) start_time=N/A duration=0:59:04.000000 size=324.462996 Mibyte bit_rate=768 Kbit/s probe_score=99 TAG:copyright=漏 2015 Youzimu Fansub TAG:encoder=Lavf58.29.100 [/FORMAT]

Conversion complete. Use Auditok to detect speech regions.

"C:\Users\xxxx\AppData\Local\Temp\tmph4x527m3.wav" has been deleted.

Converting speech regions to short-term fragments. Converting: 100% |#######################################################################################################################################################################################| Time: 0:00:50 Use the API key given in the option "-skey"/"--speech-key".

Sending short-term fragments to Google Cloud Speech V1P1Beta1 API and getting result. Speech-to-Text: 100% |###################################################################################################################################################################################| Time: 0:00:35 Receive something unexpected: {} Error: Speech-to-text failed. All works done.

njukiller commented 4 years ago

https://drive.google.com/open?id=1SrAThjZ_nCIlvB_KqL1ExsDLfiRldjMP

这是其中一个测试的小文件

njukiller commented 4 years ago

autosub -i .\tmp__ew8zco.flac -S en-US -bm all -F txt Destination language not provided. Only performing speech recognition. Speech language is the same as the Destination language. Only performing speech recognition.

Convert source file to "C:\Users\xxxx\AppData\Local\Temp\tmpz4s94ezd.wav" to detect audio regions. M:\work\autosub\autosub\ffmpeg.exe -hide_banner -y -i ".\tmpew8zco.flac" -vn -ac 1 -ar 48000 "C:\Users\xxxx\AppData\Local\Temp\tmpz4s94ezd.wav" Input #0, flac, from '.\tmpew8zco.flac': Metadata: major_brand : isom minor_version : 1 compatible_brands: isom copyright : 漏 2015 Youzimu Fansub encoder : Lavf58.29.100 Duration: 00:00:06.49, start: 0.000000, bitrate: 665 kb/s Stream #0:0: Audio: flac, 44100 Hz, mono, s32 (24 bit) Stream mapping: Stream #0:0 -> #0:0 (flac (native) -> pcm_s16le (native)) Press [q] to stop, [?] for help Output #0, wav, to 'C:\Users\xxxx\AppData\Local\Temp\tmpz4s94ezd.wav': Metadata: major_brand : isom minor_version : 1 compatible_brands: isom ICOP : 漏 2015 Youzimu Fansub ISFT : Lavf58.29.100 Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, mono, s16, 768 kb/s Metadata: encoder : Lavc58.54.100 pcm_s16le size= 609kB time=00:00:06.49 bitrate= 768.1kbits/s speed= 649x video:0kB audio:608kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.017655%

Use ffprobe to check conversion result. M:\work\autosub\autosub\ffprobe.exe C:\Users\xxxx\AppData\Local\Temp\tmpz4s94ezd.wav -show_format -pretty -loglevel quiet [FORMAT] filename=C:\Users\xxxx\AppData\Local\Temp\tmpz4s94ezd.wav nb_streams=1 nb_programs=0 format_name=wav format_long_name=WAV / WAVE (Waveform Audio) start_time=N/A duration=0:00:06.490000 size=608.544922 Kibyte bit_rate=768.135000 Kbit/s probe_score=99 TAG:copyright=漏 2015 Youzimu Fansub TAG:encoder=Lavf58.29.100 [/FORMAT]

Conversion complete. Use Auditok to detect speech regions.

"C:\Users\xxxx\AppData\Local\Temp\tmpz4s94ezd.wav" has been deleted.

Converting speech regions to short-term fragments. Converting: 100% |#######################################################################################################################################################################################| Time: 0:00:00

Sending short-term fragments to Google Speech V2 API and getting result. Speech-to-Text: N/A% | | ETA: --:--:--['{"result":[]}', '']

njukiller commented 4 years ago

测试发现用0.5.4版本就一切正常,转换字幕和翻译都可以正常完成

BingLingGroup commented 4 years ago

https://drive.google.com/open?id=1SrAThjZ_nCIlvB_KqL1ExsDLfiRldjMP

这是其中一个测试的小文件

我调试了一下大概知道了,根据你提供的文件进行测试,发现因为Auditok切出来的音频片段太短,输出时间轴显示第二段音频片段不足0.5s,Google Speech-to-Text API V2和Google Cloud Speech-to-Text API因此返回了空内容(但是Google Cloud Speech-to-Text API的文档并未说明导致空内容时的最短的音频应该有多长),而非json格式的内容(正常的返回内容应该是json的,json里面会有键-值对来表示返回结果,程序获得结果是从那里得到的,使用-of full-src可以看到,但是空内容指的是里面什么都没有,连键-值对也没有),在新版本中对空内容的处理是直接异常跳出处理,这里不该跳出,继续识别就好了。 之所以设计跳出,是因为上个版本加入了Google Cloud Speech-to-Text API以后要输出部分返回的异常信息供使用者参考,如果出错时什么都不输出也不中断不利于使用者更换配置。比如如果没有开通API,Google Cloud Speech-to-Text会返回信息(json格式)提示使用者开通API。

我的修改策略是:对于Google Speech-to-Text API V2,所有返回错误不会中断,毕竟免费API没什么需要调整的配置。如果需要查看完整返回结果,可以使用-of full-src。 对于Google Cloud Speech-to-Text API,根据返回内容动态进行调整,如果是空内容,就继续识别,如果是错误信息,就跳出。

应该来讲,这个bug完全是Google的锅(甩锅成功),文档里面都没写,只能自己调试出来。当然,Auditok的决定权也是很重要的……我考虑调整默认的最小音频片段长度,譬如大于0.8s-1s。

njukiller commented 4 years ago

https://drive.google.com/open?id=1SrAThjZ_nCIlvB_KqL1ExsDLfiRldjMP 这是其中一个测试的小文件

我调试了一下大概知道了,根据你提供的文件进行测试,发现因为Auditok切出来的音频片段太短,输出时间轴显示第二段音频片段不足0.5s,Google Speech-to-Text API V2和Google Cloud Speech-to-Text API因此返回了空内容(但是Google Cloud Speech-to-Text API的文档并未说明导致空内容时的最短的音频应该有多长),而非json格式的内容(正常的返回内容应该是json的,json里面会有键-值对来表示返回结果,程序获得结果是从那里得到的,使用-of full-src可以看到,但是空内容指的是里面什么都没有,连键-值对也没有),在新版本中对空内容的处理是直接异常跳出处理,这里不该跳出,继续识别就好了。 之所以设计跳出,是因为上个版本加入了Google Cloud Speech-to-Text API以后要输出部分返回的异常信息供使用者参考,如果出错时什么都不输出也不中断不利于使用者更换配置。比如如果没有开通API,Google Cloud Speech-to-Text会返回信息(json格式)提示使用者开通API。

我的修改策略是:对于Google Speech-to-Text API V2,所有返回错误不会中断,毕竟免费API没什么需要调整的配置。如果需要查看完整返回结果,可以使用-of full-src。 对于Google Cloud Speech-to-Text API,根据返回内容动态进行调整,如果是空内容,就继续识别,如果是错误信息,就跳出。

应该来讲,这个bug完全是Google的锅(甩锅成功),文档里面都没写,只能自己调试出来。当然,Auditok的决定权也是很重要的……我考虑调整默认的最小音频片段长度,譬如大于0.8s-1s。

超详细的分析,我也觉得可能是这个情况,因为后来在0.5.5版本我加上-of full-src, 可以输出json,而且里面是有识别的,但是第二段是一个空字段

BingLingGroup commented 4 years ago

提交 https://github.com/BingLingGroup/autosub/commit/1fe1f3d5a530585fd5aa2b01685d0d1b28cb2f53 应该能修复这个问题,感谢反馈。

BingLingGroup commented 4 years ago

为了彻底解决空结果问题,在新版本中把-sml调整为-nsml了,0.5.6a就是默认强制限制分段最小长度,这样可以避免过短的长度导致API返回空结果(其实一般过短的长度都是因为之前的长度超过最长的限制分割后产生的) 至于正常返回结果,但结果为空的情况,可以使用-der来去除,一般-der-mnc是配合使用——当然坏处是会删掉可能有意义的空轴,这个看个人权衡