dsc-uob / subtitle

Subtitle decoder package, enables you to handle translation data smoothly.
https://dsc-uob.github.io/subtitle/
MIT License
10 stars 18 forks source link

A problem in the regex of VTT format subtitles. #4

Closed tejesh-kaliki closed 1 month ago

tejesh-kaliki commented 1 year ago

I tried reading vtt format subtitles using this, but no subtitles were returned.

When searching the cause, I noticed that the following regex was used as default for the VTT format r'(\d+)?\n(?:(\d{1,}):)?(?:(\d{1,2}):)?(\d{1,2})[.,]+(\d+)\s*-->\s*(?:(\d{1,2}):)?(?:(\d{1,2}):)?(\d{1,2}).(\d+)(?:.*(?:\r?(?!\r?).*)*)\n(.*(?:\r?\n(?!\r?\n).*)*)' which caused in no subtitles being recognized.

Changing it to the following made it work correctly: r'(\d+)?\n(?:(\d{1,}):)?(?:(\d{1,2}):)?(\d{1,2})[.,]+(\d+)\s*-->\s*(?:(\d{1,2}):)?(?:(\d{1,2}):)?(\d{1,2}).(\d+)(?:.*(?:\r?(?!\r?)*)*)\n(.*(?:\r?\n(?!\r?\n).*)*)' (Removing the . in (?:.*(?:\r?(?!\r?).*)*))

I just wanted to know if that is what caused the issue, or there's something else. If it is what caused it, hope you can correct it.

bharattkukreja commented 1 year ago

I was using the package - and looks like the srt files are not being parsed properly either. The regex that @tejesh-kaliki posted works for srts as well. And the minor change that follows is fetching text with matcher.group(10) instead of 11 as it goes out of bounds otherwise. @MuhmdHsn313 it's worth fixing because I was unable to use the library without the right regex. Let me know if you want me to create a PR.

MuhmdHsn313 commented 1 year ago

Dear @bharattkukreja Yes please, create a pull request for it. Thanks

MuhmdHsn313 commented 1 year ago

Any updates @bharattkukreja @tejesh-kaliki

thinhnd-nal commented 1 year ago

@MuhmdHsn313 The bug seems to be still unresolved at version 0.1.0. I had to roll back to version 0.1.0-beta.3 to avoid the error. You can check the example file here https://thinhnd-nal.github.io/sample_files/activity_lifecycle.srt

MuhmdHsn313 commented 1 month ago

Solved with #12