KBlixt / subcleaner

removes ads from subtitle files cleanly.
288 stars 13 forks source link

Need to be a bit more strict #27

Closed winbatch closed 1 year ago

winbatch commented 1 year ago

The first block is valid. (It's from the show Devs). I am running with --sensitive and it still removed it

INFO: now cleaning subtitle: /mnt/tv/Devs/Devs.S01E03.1080p.AMZN.WEBRip.DDP5.1.x264-TEPES[rarbg]/Devs.S01E03.1080p.AMZN.WEB-DL.DDP5.1.H.264-TEPES.en.srt INFO: Done. Cleaning report: 2 deleted blocks and 7 warnings remaining.
[---------Removed Blocks----------]
12
00:03:56,470 --> 00:03:59,573
It's the celebrity sex tape
to end all celebrity sex tapes.
513
00:42:54,072 --> 00:42:56,007
Captioned by
Media Access Group at WGBH
[---------------------------------]
KBlixt commented 1 year ago

--sensitive is a debugging option that doesn't change the behavior other than increase logging.

This seems to be an unfortunate subtitle, some blocks are bound to be false positives and I'll do what I can but some false positives are unfortunately unavoidable.

This is why I'm currently in the process of developing a reviewing process where you get to review all blocks that gets deleted but that could be a likely valid subtitle.

At this point the script is really very good at finding ads and not removing valid subtitles and while I can put even more time improving the included regex, I'll never be perfect and at some point there needs to be a small manual process that takes care of the edge cases.

For now you'll have to manually restore these files, I'll take a look if the included regex needs to be adjusted somewhat here. But if it's between this one false posivite or 10 ad blocks not getting removed, I'll probably lean towards living with the false positive.

winbatch commented 1 year ago

ok. maybe we're on to something though - I misinterpreted --sensitive to mean something else. However, maybe you can have a flag that defines how 'aggressive' to be when cleaning. Not so different from how there are ranges for gzip for compression or like the number -v's for verbose, etc. So like least aggressive removes known specific/exact match strings. One step up does a bit more - like if it has http in its text or if the word 'subtitle' is in the text. next step from that, etc, etc..

ChristianMalazarte commented 1 year ago

I have no idea how this was removed. There is no regex that matches this.

| It's the celebrity sex tape
| to end all celebrity sex tapes.
KBlixt commented 1 year ago

"Celebrity sex" is a warning regex. I've seen this before, it's a office episode where they discuss this at the end of the show. Right?

I thought I fixed it then but I'll try again. Could you send me the original subtitle file?

ChristianMalazarte commented 1 year ago

Oh, I see probably in the other regex files. I only ever touch the english.conf.

No worries, not for me at least. I don't mind it. The script works pretty damn well already. Thanks for creating it :)