devmaxxing / videocr-PaddleOCR

Extract hardcoded subtitles from videos using machine learning
MIT License
142 stars 22 forks source link

For same video, use traditional chinese get repeated dialogue instead of one single continuous dialogue #13

Closed oscardoudou closed 1 year ago

oscardoudou commented 1 year ago
0
00:00:00,542 --> 00:00:02,252
植物人?

1
00:00:03,795 --> 00:00:05,088
不會吧

2
00:00:05,547 --> 00:00:07,841
在智比赛打傷了封手。

3
00:00:08,091 --> 00:00:08,633
上面没這磨窝

vs

0
00:00:00,542 --> 00:00:01,668
植物人?

1
00:00:01,710 --> 00:00:01,960
植物人3

2
00:00:02,001 --> 00:00:02,252
植物人?

3
00:00:03,795 --> 00:00:05,088
不會吧

4
00:00:05,547 --> 00:00:07,257
在練習比賽打傷了對手

5
00:00:07,298 --> 00:00:07,298
在練習比產傷

6
00:00:07,382 --> 00:00:07,382
在練習比賽打傷

7
00:00:07,424 --> 00:00:07,424
在練習比產不傷

8
00:00:07,507 --> 00:00:07,841
在練習比賽打傷了對手

9
00:00:08,091 --> 00:00:08,633
上面波這 磨寫

The former I use language code ch but some character are wrongly detected. So I figure I should change to accurate subtitle language code. The latter is same video with correct language code chinese_cht, but timeline mess up. I got repeated dialogues which are supposed to be one single continuous dialogue. Though some characters are now detected correctly, eg. 在智比赛 is now corrected detected as 練習比賽.

Any idea what parameter I should tweak or bc model for traditional chinese has some issue? Thanks! Appreciate your work.

devmaxxing commented 1 year ago

@oscardoudou The repeated dialogues is caused by different characters being detected by the traditional chinese model in subsequent frames, resulting in lines that are sufficiently different getting outputted separatedly.

Outside of improving the accuracy of the traditional chinese model which I imagine might be pretty challenging/time consuming, you could try lowering the sim_threshold threshold parameter to relax how similar lines need to be in order for them to get merged together.