KBlixt / subcleaner

removes ads from subtitle files cleanly.
284 stars 12 forks source link

False positives #2

Closed KnifeFed closed 2 years ago

KnifeFed commented 2 years ago

I just ran this excellent script on all my media and here are the false positives I encountered:

[INFO]: Removed 1 subtitle blocks:
        [---------Removed Blocks----------]
        1150
        01:03:41,519 --> 01:03:43,650
        ...sexually explosive.
        [---------------------------------]

[INFO]: Removed 1 subtitle blocks:
        [---------Removed Blocks----------]
        43
        00:02:49,711 --> 00:02:52,921
        It's a great set.
        Full HD. 1080p,
        240 hertz, TrueMotion.
        [---------------------------------]

[INFO]: Removed 2 subtitle blocks:
        [---------Removed Blocks----------]
        333
        00:17:56,568 --> 00:17:58,768
        You want me to text my professor?

        334
        00:17:58,771 --> 00:18:00,771
        Yeah. Text... text him, text him.
        [---------------------------------]

[INFO]: Removed 6 subtitle blocks:
        [---------Removed Blocks----------]
        674
        00:21:17,817 --> 00:21:20,737
        Over and over,
        rickandmortyadventures.com.

        675
        00:21:20,820 --> 00:21:23,740
        www.rickandmorty.com.

        676
        00:21:23,824 --> 00:21:25,991
        www. rickandmortyadventures.

        677
        00:21:26,076 --> 00:21:27,661  
        All 100 years.                <--- Why did this block match at all? Is it just because it's between other matched blocks?

        678
        00:21:27,744 --> 00:21:30,329
        Every minute, rickandmorty.com.

        679
        00:21:30,413 --> 00:21:34,125
        www.100timesrickandmorty.com.
        [---------------------------------]

        ^This is from Rick and Morty - S01E01. Here it seems that everything resembling a URL gets removed, but the following was just a warning:

[WARNING]: Potential ads in 1 subtitle blocks, please verify:
            [---------Warning Blocks----------]
            550
            00:56:03,304 --> 00:56:09,175
            Craving big poker? Feast your eyes on Venom.
            $5 million GTD. AmericasCardroom.com

Here's a curious case:

[INFO]: Removed 1 subtitle blocks:
        [---------Removed Blocks----------]
        698
        00:51:27,500 --> 00:51:35,500
        Ripped By mstoll
        [---------------------------------]
[WARNING]: Potential ads in 1 subtitle blocks, please verify:
            [---------Warning Blocks----------]
            7
            00:00:46,000 --> 00:00:54,000
            Ripped By mstoll
            [---------------------------------]

How come the exact same pattern is only a warning the second time it's encountered?

All in all, not bad considering how many files it ran on 👍

KBlixt commented 2 years ago

Thank you so much for this! I'm constantly looking for false positives in order to improve the regex as well as the script algorithm. I'll be looking into this tomorrow.

ultimately though this is a script that's not supposed to catch 100% of cases. but I'm sure trying :D

but yes, as you've probably guessed i'm not only looking at the exact regex matches but also if they are close to other likely ad blocks. this will result in identical blocks handled differently depending on where it is in the file. this makes it also so that blocks get different warning levels when you run the script back to back. I'm constantly tuning it to improve it! :)

finding false positives are relativly easy since you can just go through the log. I'm also trying to reduce missed ads... but it's alot more work since you'd have to go through the files manually.

KnifeFed commented 2 years ago

Tack!

KBlixt commented 2 years ago

Gött ;)