DrKain / subclean

A cross-platform CLI tool and node module to remove advertising from subtitles. Supports Bazarr and bulk cleaning!
MIT License
55 stars 5 forks source link

[Feature Request] Filter for chained/animated nodes #4

Closed DrKain closed 2 years ago

DrKain commented 3 years ago

This one is a bit trickier to handle and explain in text. Some subtitle uploaders have decided to add incredibly intrusive animated credits. They follow a similar format:

310
01:23:53,995 --> 01:23:54,470
S

311
01:23:54,470 --> 01:23:54,945
Su

312
01:23:54,945 --> 01:23:55,420
Sub

313
01:23:55,420 --> 01:23:55,895
Subt

314
01:23:55,895 --> 01:23:56,370
Subti

315
01:23:56,370 --> 01:23:56,845
Subtit

316
01:23:56,845 --> 01:23:57,320
Subtitl

317
01:23:57,320 --> 01:23:57,795
Subtitle

318
01:23:57,795 --> 01:23:58,270
Subtitles
U

319
01:23:58,270 --> 01:23:58,745
Subtitles By
Us

320
01:23:58,745 --> 01:23:59,220
Subtitles By
Use

321
01:23:59,220 --> 01:23:59,695
Subtitles By
User

322
01:23:59,695 --> 01:24:00,170
Subtitles By
Usern

323
01:24:00,170 --> 01:24:00,645
Subtitles By
Userna

324
01:24:00,645 --> 01:24:01,120
Subtitles By
Username

Right now subclean can handle nodes 319 to 324 but the preceding nodes remain. A special handler will need to be written that scans for these chained nodes. I'll probably end up doing it one of two ways.

Option A:

  1. Advertising detected at node 319
  2. Check node 318 for partial match
  3. Continue checking (and removing) previous nodes until it's unable to match

Option B:

  1. Scan the entire file for these chained nodes
  2. Remove the entire chain regardless of the content
DrKain commented 3 years ago

Here is an example of these chained nodes in a subtitle file.
This is a manually cleaned version of the file showing the ideal outcome.

Eytan414 commented 2 years ago

I decided to implement option A because B seemed bit too brute and might cause undesired effects, I've made a PR but there's a caveat - it doesn't work on provided subtitle file example but I think it's due to node #318 which is probably a rare case or a typo.

I haven't really worked on subtitle files in the past so I guess you'll have a better assessment than I, here are the relevant details:

01:23:57,320 --> 01:23:57,795
Subtitle

318
01:23:57,795 --> 01:23:58,270
Subtitles
U

319
01:23:58,270 --> 01:23:58,745
Subtitles By
Us 

My code works either if node 318's text had " by" after "Subtitles" in top row or 2nd row("U") didn't exist which I believe is a subtitle uploader which made a mistake in their ad.

DrKain commented 2 years ago

Node 318 was probably a mistake on my end, I had no examples on hand so I wrote that one out.

Linking to #20