KBlixt / subcleaner

removes ads from subtitle files cleanly.
284 stars 12 forks source link

Create Dutch config #25

Closed 854562 closed 1 year ago

854562 commented 1 year ago

This is my attempt at a config file for Dutch subtitles. I took the Swedish config as a starting point, and added some common Dutch ads, release groups, and some names of both community and professional subtitlers.

I am by no means a regex expert (in fact this is my first real dive into the matter), but I have had it do a few dry runs and it seems to be working well.

One thing: I am not really sure what kind of lines "^(teams?|the)$" is supposed to target, but since it is present in both the English and Swedish config files I left it in there.

Looking forward to your comments!

KBlixt commented 1 year ago

Good morning 854562,

Very nice! I'll take a look at this after work today! In fact I've been in contact with another Dutch speaking user and have a candidate Dutch regex from him aswell, I'll send it to you if you'd like, but your version looks more fleshed out and following what I'd expect how new language profiles are created.

Yeah the teams|the line is supposed to target some specific ads, but I belive that I've moved that regex to the global profile in my latest uncommitted version.

KBlixt commented 1 year ago

so this is looking really good I'm very impressed, will add them to the defaults when you want. some stuff that I've changed in the swedish one that you based this on:

"celebritysizes" is actually a legit site and I added it to the purge list since I just assumed that it was spam, but as it turns out in some episode in "The Office" they are mentioning the site.

the "^(teams?|the)$" will be moved to the global warning in the next regex version, you can leave it in there and I'll modify it when I push next time.

Sartre is a name mentioned in some series. so I've moved that one as well to the global warning regexö. again, you can leave it be and I'll move it later.

nl_purge10, I've found to use this purge is a bit to aggressive, maybe not the case in Dutch but some names could contain these words like companies or such, and sometimes they translate the name but sometimes they leave the name as is. for the swedish profile I put it in the warning section twice, these words should then recieve 2 warnings if they ever appear and would be very close to being deleted.

I'm not sure if you've seen some of the debug options that the script have. specifically --dry-run, --explain and --end-report are very useful for finding poor results and fix them.

--dry-run is self explanatory, --explain will put a row under each removed/warned block and tell you what regex it got matched with or other warnings it received from the script. --end-report is printing all unique blocks that got removed/warnings, It'll order the deleted blocks from fewest blocks removed since false-positives are most likely going to have unique content. and it'll order the warning blocks from most blocks warned since blocks with the same content in multiple different files are likely to be false-negatives. --sensitive it'll log all blocks adjacent to ads and the first/last block in the file as warnings regardless of how many warnings they actually got, it won't affect the result.

I'm also going to add the translated "season xx episode xx" to my swedish regex profile, that was a good idea!

thank you so much, this is gonna be great!

854562 commented 1 year ago

Thanks for the detailed reply!

As I was working on a small update anyway, I took the liberty to incorporate your suggestions right away.

I agree that the risk of false positives is relatively high having those English words on the purge list. I've seen it target some English song lyrics, for example, which are often left untranslated in subtitles. I moved the entire line to warn and duplicated it (with some exceptions which I added to nl_warn4 instead, as they are more common in Dutch); I hope that's the way you had in mind. Small question though: rip(ped)? and (re-?)synch?(ed|ro(nized)?)? appear four times now; I am not sure if they were supposed to be duplicated in the first place, and if they were, what's the intention behind that?

As for the other person making a regex list, I am curious to see what he came up with, as there is bound to be stuff I missed or haven't thought of. If he's okay with it I am happy to see if I can merge it with my config.

Thank you for the debug suggestions! I had already discovered dry-run and explain, but sensitive has been a big help finding more translator names to add.

If the config passes all your requirements I think it's ready to be added to the defaults.

KBlixt commented 1 year ago

really excellent work!

here is the config that I've seen from the other guy. I had a look but yours didn't seem to miss anything major over his, not I've renamed it to .txt since I can't upload .conf to github for some reason.

dutch.txt

I'll take a look at those things you mentioned, sounds like something I missed.

854562 commented 1 year ago

Thanks! Not bad, definitely someone who's familiar with the Dutch subtitling scene. Looks like the vast majority of it would be caught by either my or the global config, though. However, there are some ads and names in there that I don't have, I'll borrow those for when I update my config. Please thank them for their efforts!

Edit: I was wrong about one of the duplicates: (re-?)synch?(ed|ro(nized)?)? and (re-?)?synch?(ed|ro(nized)?) are very much different; didn't catch that.

KBlixt commented 1 year ago

Sounds good 👍

If you ever make a pull request I'll accept them whenever I see them, you won't need to defend them if you don't want to.