KBlixt / subcleaner

removes ads from subtitle files cleanly.
288 stars 13 forks source link

Avoid detecting "lyrics" #38

Closed frasderp closed 1 year ago

frasderp commented 1 year ago

I have noticed when there are lyrics in a Show or Movie, that it triggers the similar content warning, among others (has resulted in incorrect removals also). These blocks typically contain a music symbol ♪. I wonder could this be used to ignore that section?

      |                                         [---------Warning Blocks----------]
      |                                         6
      |                                         00:00:22,435 --> 00:00:30,515
      |                                         ♪ It's Adventure Time ♪
      |                                         Ripped By mstoll
      |                                         reasons: (en_warn1, en_warn2)
      |                                         [---------------------------------]
KBlixt commented 1 year ago

I'm sorry, but that block should get removed in my opinion since it contains a ripper credit.

I wouldn't want to ignore this specific block but if you could provide me with a false positive that is lyrics that is accidentaly receiving a warning or even getting removed I promise I'll take a look at it.

frasderp commented 1 year ago

Sorry @KBlixt that was another example I was meaning to share with you, to add the ripper group to your regex!

I would suggest if we could detect ♪ and similar_content together, that may help to avoid music blocks.

Here is one that is being picked up due to similar content, and also in my Spanish regex (I shared this in the discussion section), it is detecting mejores, which is Spanish for improves (which I have since removed).

      |     [---------Removed Blocks----------]
      |     472
      |     00:21:14,013 --> 00:21:16,633
      |     ♪ ¿Las mamás son las mejores ♪
      |     reasons: (es_warn1, adjacent_ad, similar_content)
      |     
      |     473
      |     00:21:16,733 --> 00:21:20,603
      |     ♪ ¿Las mamás son las mejores,
      |     las mamás son las mejores ♪
      |     reasons: (es_warn1, es_warn1, similar_content)
      |     
      |     479
      |     00:21:30,873 --> 00:21:33,450
      |     ♪ ¿Las mamás son las mejores ♪
      |     reasons: (es_warn1, adjacent_ad, similar_content)
KBlixt commented 1 year ago

Ok, yes, I see what you mean. Seems like a fair exclusion from the similar content pattern. I'll take a look at it tomorrow 👍