DrKain / subclean

A cross-platform CLI tool and node module to remove advertising from subtitles. Supports Bazarr and bulk cleaning!
MIT License
54 stars 5 forks source link

Chain removal with extra flexibiliity for character "gaps" and no filters #22

Open DanAykroyd256 opened 1 year ago

DanAykroyd256 commented 1 year ago

Hi @DrKain, thanks first of all for your great tool! I'm using it with Bazarr and it works great!

I recently found out about the --nochains command, and I had one subtitle with a chain to try it. Although it removed part of the chain, it cut off because the "animation" they try to pull off, changed two characters at one; from line to line, so the tool didn't take it as a chain. I think it's a similar example of the one cited at the original request for the removal of chains, where the OP mistyped the example and the removal didn't completely work.

This is the chain I have. Using the "mobile" text as an ad to remove, I got the result of "[Match] Chain found at 9-24 (mobile - +919815899536)". As you can see, line 8 to 9 made a change of more than one character; hence why the chain wasn't completely removed.

In any case, there wouldn't have been a way for me to catch this if it wasn't manually, because there is not a clear ad word to filter. So, my questions are:

Thanks again and have a great day!

Example Subtitles ``` 1 00:00:02,340 --> 00:00:02,540 © 2 00:00:02,540 --> 00:00:02,740 © 3 00:00:02,740 --> 00:00:02,940 © P 4 00:00:02,940 --> 00:00:03,140 © P@ 5 00:00:03,140 --> 00:00:03,340 © P@r 6 00:00:03,340 --> 00:00:03,540 © P@rM 7 00:00:03,540 --> 00:00:03,740 © P@rM! 8 00:00:03,740 --> 00:00:03,940 © P@rM!N 9 00:00:03,940 --> 00:00:04,140 © P@rM! Nd 10 00:00:04,140 --> 00:00:04,340 © P@rM! Nde 11 00:00:04,340 --> 00:00:04,540 © P@rM! NdeR 12 00:00:04,540 --> 00:00:04,740 © P@rM! NdeR 13 00:00:04,740 --> 00:00:04,940 © P@rM! NdeR M 14 00:00:04,940 --> 00:00:05,140 © P@rM! NdeR M@ 15 00:00:05,140 --> 00:00:05,340 © P@rM! NdeR M@n 16 00:00:05,340 --> 00:00:05,540 © P@rM! NdeR M@nk 17 00:00:05,540 --> 00:00:05,740 © P@rM! NdeR M@nkÖ 18 00:00:05,740 --> 00:00:05,940 © P@rM! NdeR M@nkÖÖ 19 00:00:05,940 --> 00:00:06,140 © P@rM! NdeR M@nkÖÖ 20 00:00:06,140 --> 00:00:07,340 © P@rM! NdeR M@nkÖÖ ™ 21 00:00:07,540 --> 00:00:08,340 © P@rM! NdeR M@nkÖÖ ™ 22 00:00:08,540 --> 00:00:09,340 © P@rM! NdeR M@nkÖÖ ™ 23 00:00:09,540 --> 00:00:10,340 © P@rM! NdeR M@nkÖÖ ™ 24 00:00:10,340 --> 00:00:11,340 © P@rM! NdeR M@nkÖÖ ™ Mobile - +919815899536 ```
DrKain commented 1 year ago

Hi, thanks for the feedback and detailed issue.

Would it be possible to do a removal of ANY chain; as was also suggested in the original request, so any crazy chain like this could be detected; without any match from the filters?

This is definitely possible and a good suggestion, I'm currently away from home and will not be able to add this anytime soon, but I'll leave this issue open in case someone wants to take a shot at it while I'm away. If not, I'll work on adding this when I'm available next.
It's worth noting that this would risk incorrect cleans when a line is repeated by one or more people during a scene.

Would it be possible to have kind of a "threshold", to be able to continue the chain even if there is a difference of more than one character between the lines?

Also yes, but in the case of nodes 8-9 there's 3 characters difference, a fuzzy match would also risk breaking valid lines. I try to keep the rules as strict as possible to avoid removing valid subtitles, so I'll need to look more into this.

For now the nodes you have can be cleaned simply with: /^©/
This will look for lines starting with © and remove them, I've added this to the main filters so you can run subclean --update to fetch them.

DrKain commented 1 year ago

Linking this to #20 and #4 as they are related

DanAykroyd256 commented 1 year ago

Thanks for your quick reply @DrKain and your consideration for improving this! I agree that handling all these edge cases might get crazy :)

I’ll leave you here the .srt of my example, for if you want to use it when you look into this in the future.

Antiviral (2012).en.srt.txt

Have a great week!