KBlixt / subcleaner

removes ads from subtitle files cleanly.
284 stars 12 forks source link

False Positives #29

Closed frasderp closed 1 year ago

frasderp commented 1 year ago

Just wanted to share a few false positives that have come up, and how they might be captured in the regex (just using global and english regex configs).

This came from Nightcrawler, I believe they are radio callsigns

      |     [---------Removed Blocks----------]
      |     812
      |     01:04:28,679 --> 01:04:30,530
      |     3X21, go ahead
      |     
      |     813
      |     01:04:31,280 --> 01:04:34,610
      |     3X21, confirm address on the 211 at Bonhill Road.
      |     [---------------------------------]
      | 
      |                                         [---------Warning Blocks----------]
      |                                         130
      |                                         00:12:53,500 --> 00:12:58,502
      |                                         7-X-76 Roger this is David 1099965
      |                                         [---------------------------------]

and this from Wreck-It Ralph

      |     [---------Removed Blocks----------]
      |     1566
      |     01:39:59,576 --> 01:40:00,827
      |     You fixed it!
      |     
      |     1580
      |     01:40:56,175 --> 01:40:57,382
      |     You fixed it!
      |     [---------------------------------]
KBlixt commented 1 year ago

I'll take a look at these. They seem to be a bit more aggressive than I wish. It would be great if you could send me the subtitles.

frasderp commented 1 year ago

Hey @KBlixt happy to. Where can I send the files?

KBlixt commented 1 year ago

Just put them in here. 👍

KBlixt commented 1 year ago

But you don't need to, I know why they are deleted.

I'll see what I can do about these, but they seem to be just unfortunate subtitles that gets cought. Some false positives are inevitable, I'm working on a easy to use review process to deal with these potential edge-cases that will restore/delete false positives/negatives.

frasderp commented 1 year ago

@KBlixt ok thanks. Would you mind sharing which part of the regex is catching them? I'd like to review the conf also!

KBlixt commented 1 year ago

If you use the --explain option you'll get a list of reasons why a block got deteleted. But in these cases I belive they are:

First subtitle: Both of them get 2 warnings from regex: global warning 2 and 3 specifically this part "\b\d+\Wx\W\d+\b " for the 3x21 at the start.

Then since they are both close to a block with two warnings they both get an additional warning.

Second subtitle: They both get one for having identical content as another block, and they both get one warning for having "fixing" in them, and then a final warning for being close to another block with 2 warnings (within 15 blocks so barely in range)

JackBailey commented 1 year ago

A few from my use:

          |                                         [---------Warning Blocks----------]
          |                                         90
          |                                         00:05:14,633 --> 00:05:16,347
          |                                         <i>Created
          |                                         by doing some tillage</i>
          |                                         reasons: (en_warn1, en_warn2)
          |                                         [---------------------------------]
          |                                         [---------Warning Blocks----------]
          |                                         124
          |                                         00:08:04,613 --> 00:08:06,824
          |                                         were virtually created by the ABA.
          |                                         reasons: (en_warn1, en_warn2)
          |                                         [---------------------------------]
[---------Warning Blocks----------]
          |                                         453
          |                                         00:39:02,177 --> 00:39:04,470
          |                                         - Let's get this fixed right now.
          |                                         - It's fixed.
          |                                         reasons: (en_warn1, en_warn1)
          |                                         [---------------------------------]
[---------Warning Blocks----------]
          |                                         342
          |                                         00:24:08,282 --> 00:24:10,708
          |                                         Had a case before
          |                                         O'Dwyer... uh, copyright.
          |                                         reasons: (en_warn7, global_warn4)
          |                                         [---------------------------------]
          |                                         [---------Warning Blocks----------]
          |                                         575
          |                                         00:39:53,374 --> 00:39:54,801
          |                                         That's copyright infringement.
          |                                         reasons: (en_warn7, global_warn4)
          |
          |                                         596
          |                                         00:41:11,850 --> 00:41:14,295
          |                                         Meanwhile, we have to counter
          |                                         the copyright injunction.
          |                                         reasons: (en_warn7, global_warn4)
          |
          |                                         603
          |                                         00:41:34,825 --> 00:41:37,453
          |                                         Okay, now, with the
          |                                         copyright infringement, I think we...
          |                                         reasons: (en_warn7, global_warn4)
          |                                         [---------------------------------]

Any way to remedy this, as I'd like to use this tool but removing things like this wouldn't make it feasible for me. Copyright mentions are due to it being a "law show"

KBlixt commented 1 year ago

@JackBailey

Im sorry but from what I understand these subtitle blocks aren't removed they are simply warnings meaning that they will not be removed

Warnings is a way to bring attention to blocks that just barely wasn't removed in order to make it easier to see stuff that is close to being removed.

Or am I missunderstanding something here?

I have however removed the copyright regex in the English profile so that the word copyright is allowed to be present twice in a single block. That was left behind in the English profile by mistake.

But none of the blocks you provided would have been removed perviously and will not appear even as warnings going forward.

JackBailey commented 1 year ago

Oh right okay thank you, it was my misunderstanding

KBlixt commented 1 year ago

No prob 🙂 And thanks for helping me find the duplicate regex 👍