Inconsistent removal of ads with same text, even within same file

Came across an ad that is making through somehow, they do get removed sometimes, very inconsistent.

2022-12-05 07:43:43: SUBTITLE: "/movies/Amityville Horror, The (2005)/The Amityville Horror (2005) [Bluray-720p].en.srt"
2022-12-05 07:43:43:     [INFO]: Didn't run language detection.
2022-12-05 07:43:43:     [INFO]: Removed 2 subtitle blocks:
2022-12-05 07:43:43:             [---------Removed Blocks----------]
2022-12-05 07:43:43:             733
2022-12-05 07:43:43:             01:22:43,006 --> 01:28:32,805
2022-12-05 07:43:43:             Subtitles by ARAVIND B
2022-12-05 07:43:43:             [by_agentsmith@yahoo.com]
2022-12-05 07:43:43: 
2022-12-05 07:43:43:             734
2022-12-05 07:43:43:             01:28:33,305 --> 01:28:39,479
2022-12-05 07:43:43:             Shop this show's fashion on LookLive.com
2022-12-05 07:43:43:             [---------------------------------]
2022-12-05 07:43:43:     [WARNING]: Potential ads in 1 subtitle blocks, please verify:
2022-12-05 07:43:43:                [---------Warning Blocks----------]
2022-12-05 07:43:43:                1
2022-12-05 07:43:43:                00:00:06,000 --> 00:00:12,073
2022-12-05 07:43:43:                Shop this show's fashion on LookLive.com
2022-12-05 07:43:43:                [---------------------------------]
2022-12-05 07:43:43:     [INFO] To remove all these blocks use: 
2022-12-05 07:43:43: subcleaner '/movies/Amityville Horror, The (2005)/The Amityville Horror (2005) [Bluray-720p].en.srt' -d 1

One gets removed, the other doesn't even though it appears to be exactly the same.

This is expected behavior. The script not only use regex to estimate which blocks are ads but it also looks at proximity to other suspicious subtitle blocks.

This means that the exact same block could get a warning flag in the start and flagged as an ad at the end of the movie. Reason for this could be that there are surrounding blocks that increase the suspiciousness around that part of the subtitle.

But you've made me think of an improvement for the cleaner. I could look if a warning block is similar enough to an ad block and use that as a trigger to remove these blocks.

I've added looklive to thebdafauot regex. but if you see a common ad making it through I'd recommend adding it to the regex as a more reliable way to remove them. Or letting me know in the discussion section and I'll ad it to the default regex.

I think the default regex should only contain aggressive ad, like donating, betting, racist, war .... All other regex should belong to a folder like community regex, which control by a setting (bool in the config, and default is inactive, or even more simple the community regex file is only a template file, so user can start work from that, no need to integrate into the app, like a docker-compose example), so you as the repo owner doesn't need to took much time to check for false positive, focus more on the algorithm part, and user have an option to use the more advance regex or regex in their language if they like. Sorry for my bad English, not my native language.

The current default regex should only target ads as it is. I'd like to know which regex patterns you feel are not specific enough? My goal is to reduce false wherever possible so if you've found any false positives let me know and I'll improve the regex by making it more specific. Because as you say, trusting the script to only remove ads are critical. Looking through all the removed ads in order to confirm that there wasn't any false positives shouldn't be expected.

Another goal is to make the script easy to install and use and still be confident in the results. Looking at regex and customizing it is for advanced users and users should only really do major customizations if they really know what they are doing).

I am working on the algorithm as well but the algorithm and the regex work hand in hand, and work together to identify different suspicious patterns. Some patterns are more obvious than others which is why I combine both parts to a score and then filter based on that score. Most of the score will come from the regex but some of the score will come directly from the algorithm.

The included regex is what I recommend and there isn't anything stopping the community from creating and maintaining a more aggressive regex config. Usually though if the community improves the regex there is no downside to using it since most people agree with me that 1 false positive is worse than 10 false negatives. If that were to change we could have both configs in the tool and then the user could enable the more aggressive default config in the master config file, as you say with a boolean.

We could even develop an entirely different algorithm to use with the community regex if they have something they'd like to try.

No problem with the English, I hope I understood your post correctly. English isn't my native language either.

My thought:

For the algorithm part, instead of just purge and warning system (3 warning = delete), we can try weight. For confidence ad keyword, set weight to like 60 (purge regex). For each match in warning, set a weight of each match to 10-30. Adjacency block get a penalty of 20-30. Option to decrease the weight is essential, so some edge case can be avoid (silly example: "Timing by [somebody]" is consider an ad, but some movie got the line "timing by the government", in current regex, you must use regex lookaround to avoid that, but it only can use once. With the weight system, you can assign "timing by the government" a weight of like -20 and so on). Now just calculate the total weight of the block. In the app, user can set the weight setting (like confidence) for remove and warning: 60 for remove and 40 for warning. Some time in subtitles file, a block contain multiple lines, an ad in 1 sentence and a normal line in anothers line, so options to remove only the line contain ads is helpful (so weight per sentences, not per block. Sentence seperator can be ". " or "\n", ...). Also the script has option to trace logging or debug logging with each total weight per sentence/block so user can easy debug the false positive.
For the regex part, now the file is just lines with tuple of (keyword or keyword1|keyword2|... to avoid duplicate match, weight).

What i mean in the previous comment is i have my own regex in my language, but if i create a PR to the default regex, it may cause more harm because it is just my preference with limited subtitles file i encouter. But if you have a folder like community regex or user share regex, i'm more happy to create a PR for it, so others can have a look and modify to their need.

For the algorithm part, instead of just purge and warning system (3 warning = delete), we can try weight. For confidence ad keyword, set weight to like 60 (purge regex). For each match in warning, set a weight of each match to 10-30. Adjacency block get a penalty of 20-30.

well, the warning system is a rough weighting system. each match in warning is given 1 weight and at 3 or more weights they get marked as an ad. And I've kept the weighting system simple in order to keep it predictable.

Option to decrease the weight is essential

I think very few users would actually customize the weighting system. But it could be a tool to improve results nevertheless, it would break existing configs and I'd prefer to not do that. it would require a major overhaul of the entire script and I don't think this feature is important enough to be prioritized over other improvements.

With the weight system, you can assign "timing by the government" a weight of like -20 and so on

I've considered implementing a "white regex option" that removes warnings. I've decided against it for now but I'm open to change my mind. My argument against a white regex is that it if an entry into that if the regex is causing false positives then the regex causing that false positive should be changed. If the regex falsely flagged "timing by the government" and I add it to the white regex. Then later if the same bad regex falsely flag "timing by the UN" I would have to add that to the white regex as well. it would be better and more reliable in the future to alter the regex that causes those false positives. and If this alteration causes some false negatives then I could either choose to live with that or try to implement a different method to catch those ads.

Some time in subtitles file, a block contain multiple lines, an ad in 1 sentence and a normal line in anothers line, so options to remove only the line contain ads is helpful (so weight per sentences, not per block. Sentence sepperator can be ". " or "\n"

I've tried to find ads per line or per sentence in order to catch blocks that contains both ads and content. it didn't work very well. I could give it another shot in the future but it's much harder then per block. And from my experience very few subtitles contains these hybrid blocks.

Also the script has option to trace logging or debug logging with each total weight per sentence/block so user can easy debug the false positive.

Good idea. It would be quite a bit of effort to accomplish this feature. I have ideas to increase reviewability into the tool. and this could potentially be part of that feature.

For the regex part, now the file is just lines with tuple of (keyword or keyword1|keyword2|... to avoid duplicate match, weight

most of the purge regex as it stands are just keyword1|keyword2|specific ad phrase1| specific ad phrase2. There are some exceptions but these are mostly combined keywords and phrases. The warning regex is a bit more unspecific but since a regex require multiple matches here it's very very uncommon for these words to appear in the same sentence unless they are an ad.

...if i create a PR to the default regex, it may cause more harm because it is just my preference with limited subtitles file i encouter.

I review everything going into the default regex. if you have any custom regex then that I don't allow in the default regex file then you could just write an addition regex config to include those.

But if you have a folder like community regex or user share regex, i'm more happy to create a PR for it, so others can have a look and modify to their need.

If I just accept everybodies regex without looking at it then any change to the community regex edition could break the regex for someone else. So a community regex would also require a maintainer in order to prevent that. But it can't be me since I would simply allow or disallow the same regex into that regex as the default regex since I'm trying to achive the best possible result with both of them.

I've added looklive to thebdafauot regex. but if you see a common ad making it through I'd recommend adding it to the regex as a more reliable way to remove them. Or letting me know in the discussion section and I'll ad it to the default regex.

Will do. That's the only common ad I've found so far, it's done a pretty good job. Thanks for the excellent work.

KBlixt / subcleaner

Inconsistent removal of ads with same text, even within same file #20