[Bug] Issue with french characters: é, è, ç

Gryzounours commented 1 year ago

After running subclean on srt files with é è ç and many more, they are replaced by strange � characters. Is there a way to solve it ?

Thanks a lot

DrKain commented 1 year ago

Thanks for reporting, do you have a sample of the subtitle file before being modified so I can look into this?
The version number would also help.

Gryzounours commented 1 year ago

Hey, here it is dh1998.zip

version 1.5.0

DrKain commented 1 year ago

Thanks for the sample. The issue appears to be caused by the file encoding (related #8), the file you provided is ASNI and the output, once cleaned, is UTF-8.

If you're using Bazarr you can enable an option in settings to automatically convert these files to UTF-8 without breaking characters:
Settings → Subtitles → Post-Processing → Encode Subtitles To UTF8

Or you can fix the current file using Notepad++, simply clicking "Convert to UTF-8"

There is an open issue for this #8 that will be resolved when I get the time. Sorry for the inconvenience.
I will leave this issue open until the linked issue is closed.

DrKain commented 1 year ago

Here's the cleaned file you provided with the correct format: dh1998-utf8.zip

Gryzounours commented 1 year ago

I don't use Bazarr, I use: https://github.com/Valyreon/Subloader

Great little tool.

Can't wait until you fix it ;)

Have a nice day

Arecsu commented 7 months ago

Woops, I think this bug is sort of critical. I made the mistake to download the latest binary and confidently ran the tool against my whole HTPC library. I don't have Bazarr, I used the tool directly.

Those spanish subtitles that were not encoded as UTF-8 ended up full of weird characters and overwritten by it. Now I'm trying to figure out which subtitles were corrupted in the process and search for them again :/

Great tool by the way, it does the job! Although, this bug killed 30% of my library 🥲

DrKain commented 7 months ago

Sorry to hear that @Arecsu. I'll try prioritize a fix when I can. I've been in and out of hospital for the last few months so I've not had a lot of time to work on this.

DrKain commented 7 months ago

If anyone has more problems with the latest version please open a new issue

DrKain commented 7 months ago

Final comment just for comparison so I don't forget it later on.

Code_RGYxOIw38x

Arecsu commented 7 months ago

Sorry to hear that @Arecsu. I'll try prioritize a fix when I can. I've been in and out of hospital for the last few months so I've not had a lot of time to work on this.

Hey! I didn't know you were going through a difficult time, I hope you and everyone else is doing well now 🙏

I went to sleep, and just woke up to find out you've managed to fix the issue. Wooow. Highly appreciate it. Will test it later. Thank you so much!!

Arecsu commented 7 months ago

hey! re-opening this issue because there's still encoding issues.

Here is the source:

1
00:02:10,058 --> 00:02:11,530
- Howard.
- Buenos días.

2
00:02:11,600 --> 00:02:14,493
- Entrega de McGill.
- ¿Qué haces aquí?

3
00:02:14,563 --> 00:02:16,995
No te he visto.
Quise ver cómo estabas.

Here is the result:

1
00:02:10,058 --> 00:02:11,530
- Howard.
- Buenos d}as.

2
00:02:11,600 --> 00:02:14,493
- Entrega de McGill.
- }Qu} haces aqu}?

3
00:02:14,563 --> 00:02:16,995
No te he visto.
Quise ver c}mo estabas.

This is the log:

[Info] Encoding: cp1252, Language: spanish
[Info] Language is spanish, using ascii

The source encoding is indeed cp1252. But then, it seems to use ASCII to process the subtitles. ASCII doesn't have the needed characters. Hmmm would not be better to process the files using UTF-8?

Here is the file attached in TXT format: example.txt

Thank you so much!

DrKain commented 7 months ago

Thanks for reporting, I'll look more into this later on in the week.
The different encodings can be tricky, UTF-8 was the original target but that was breaking some subtitles as reported in this issue. Some encodings will break the parser meaning the tool can't read each node and process the text, so this is why I originally started using UTF-8. I'll work something out eventually.
Thanks for the example file too.

DrKain commented 7 months ago

If you have the time I've whipped up a test build if you wanted to give it a shot. The file size is too large for GitHub chat so I had to use dropbox. This changes the default encoding to utf16le and allows you to pass your own encoding using --encoding utf8 to see what I mean about the broken characters. I'll still look more into this when I get the time but it has been a very busy week. Thanks for reporting the issue though.

This is no-longer the case, --encoding and --encodefile parameters will be removed in the next update as the 1.7.0 added support for a bunch more formats so these should not be required anymore

Gryzounours commented 7 months ago

Let's imagine a test case with a lot of french and spanish subtitles in a folder and various subfolders: some of them are encoded with utf8, others with Ansi, if we run subclean --sweep, What would happen ? it would convert the files first to utf8 then run the cleaning algorithm ?

DrKain commented 7 months ago

The --sweep rule will still obey all regular parameters so the character encoding will be changed. Currently the best way I've found to convert files was using Notepad++. Open the file, click "Encoding" at the top and click "Convert to UTF-8".,

Clearly there's something odd going on with how NodeJS handles character encodings so I'll need to look more into this when I get the time. Or, if another user wants to fork the repo and take a shot they're more than welcome.

DrKain commented 7 months ago

Took 3-4 hours but I think I've finally fixed the issue, I'm publishing a new version in a few minutes. If another encoding error pops up that is not supported please open a new issue, I'll need to add custom support for the unique cases.

Thank you all for providing subtitles to test with too, they helped a lot.

DrKain / subclean

[Bug] Issue with french characters: é, è, ç #33