CCExtractor / ccextractor

CCExtractor - Official version maintained by the core team
https://www.ccextractor.org
GNU General Public License v2.0
724 stars 428 forks source link

[BUG] A mix of 8-bit/16-bit chars sent to iconv #1451

Open erankor opened 2 years ago

erankor commented 2 years ago

Necessary information

Video links

http://cdnapi.kaltura.com/p/2035982/playManifest/entryId/1_frxnu0yr/flavorId/1_tr3kiz6l/format/download/a.ts

Additional information

Hi all,

I have some TS file with 708 subtitles in Japanese & Chinese that failed to decode properly. After some debugging, I found that if I patch the function write_utf16_char here - https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ccx_decoders_708_output.c#L113 to always output 2 byte chars (I changed the if to if (1)), and I specify an encoding of UTF-16BE, it decodes properly.

This code looks off to me, as it creates a mix of 8-bit & 16-bit chars with no clear encoding (it's not UTF-8 and it's not UTF-16...). Maybe when iconv is used, the function should always output 2 byte chars? Or, alternatively, if it would use 2-bytes for ALL chars if there is ANY char that doesn't fit in 1-byte, it would also be ok (but this sounds more complex to do...).

Btw, VLC decodes the Japanese & Chinese properly, after changing the 'preferred closed captions decoder' setting from 608 to 708.

Thanks!

Eran

PunitLodha commented 2 years ago

Could you share the output of ccextractor --version?

erankor commented 2 years ago
./ccextractor --version
CCExtractor 0.89, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
CCExtractor detailed version info
        Version: 0.89
        Git commit: b793f16343dc442bcb977387fcef08195e71dd7c
        Compilation date: 2022-08-23
        File SHA256: 259ccd18d508a3aed03149080853f98d1bce57672ce20c9b715953227621c9d9
Libraries used by CCExtractor
        Tesseract Version: 3.03
        Leptonica Version: leptonica-1.70
        libGPAC Version: 1.0.1
        zlib: 1.2.11
        utf8proc Version: 2.4.0
        protobuf-c Version: 1.3.1
        libpng Version: 1.6.37
        FreeType
        libhash
        nuklear
        libzvbi
PunitLodha commented 2 years ago

You are using version 0.89. Could you try using the latest version(0.94)?

erankor commented 2 years ago

Reverted my change and pulled latest master, it is decoding stuff (which is better than previous version IIRC...), but still every space in the text messes it up, and I get some non-printable chars in the output.

Output without any code changes - 1 00:00:01,068 --> 00:00:03,770 人々が私を知‰挰弰栰䴰Ź섰漠時間管理につい‰晦<U+F830>䐰昰䐰縰

Output after forcing write_utf16_char to always use 2 chars - 1 00:00:01,068 --> 00:00:03,770 人々が私を知 ったとき、私は 時間管理につい て書いています

I don't speak Japanese myself :) but google translate can confirm the fixed version is better.

Current version -

./ccextractor --version
CCExtractor 0.94, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
CCExtractor detailed version info
        Version: 0.94
        Git commit: 4cb474c5a36b61bafec4a2379c4d0b240e44359b
        Compilation date: 2022-08-24
        CEA-708 decoder: C
        File SHA256: 8fd4f5625eb6aadb30532a2ff9f29adaec4b60a77916e3f001d5f4e59d4d08e9
Libraries used by CCExtractor
        Tesseract Version: 3.03
        Leptonica Version: leptonica-1.70
        libGPAC Version: 1.0.1
        zlib: 1.2.11
        utf8proc Version: 2.4.0
        protobuf-c Version: 1.3.1
        libpng Version: 1.6.37
        FreeType
        libhash
        nuklear
        libzvbi
PunitLodha commented 2 years ago

You could send a PR. If it doesn't cause any issues with the other tests, then we can merge it

ArchitBhonsle commented 1 year ago

Was this fixed? I could make a simple pull request with the specified changes.

cfsmp3 commented 1 year ago

Was this fixed? I could make a simple pull request with the specified changes.

Probably not if it's still open :-) Feel free to give it a shot.

prateekmedia commented 1 year ago

Created a PR: #1571