Open erankor opened 2 years ago
Could you share the output of ccextractor --version
?
./ccextractor --version
CCExtractor 0.89, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
CCExtractor detailed version info
Version: 0.89
Git commit: b793f16343dc442bcb977387fcef08195e71dd7c
Compilation date: 2022-08-23
File SHA256: 259ccd18d508a3aed03149080853f98d1bce57672ce20c9b715953227621c9d9
Libraries used by CCExtractor
Tesseract Version: 3.03
Leptonica Version: leptonica-1.70
libGPAC Version: 1.0.1
zlib: 1.2.11
utf8proc Version: 2.4.0
protobuf-c Version: 1.3.1
libpng Version: 1.6.37
FreeType
libhash
nuklear
libzvbi
You are using version 0.89. Could you try using the latest version(0.94)?
Reverted my change and pulled latest master, it is decoding stuff (which is better than previous version IIRC...), but still every space in the text messes it up, and I get some non-printable chars in the output.
Output without any code changes - 1 00:00:01,068 --> 00:00:03,770 人々が私を知‰挰弰栰䴰Ź섰漠時間管理につい‰晦<U+F830>䐰昰䐰縰
Output after forcing write_utf16_char to always use 2 chars - 1 00:00:01,068 --> 00:00:03,770 人々が私を知 ったとき、私は 時間管理につい て書いています
I don't speak Japanese myself :) but google translate can confirm the fixed version is better.
Current version -
./ccextractor --version
CCExtractor 0.94, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
CCExtractor detailed version info
Version: 0.94
Git commit: 4cb474c5a36b61bafec4a2379c4d0b240e44359b
Compilation date: 2022-08-24
CEA-708 decoder: C
File SHA256: 8fd4f5625eb6aadb30532a2ff9f29adaec4b60a77916e3f001d5f4e59d4d08e9
Libraries used by CCExtractor
Tesseract Version: 3.03
Leptonica Version: leptonica-1.70
libGPAC Version: 1.0.1
zlib: 1.2.11
utf8proc Version: 2.4.0
protobuf-c Version: 1.3.1
libpng Version: 1.6.37
FreeType
libhash
nuklear
libzvbi
You could send a PR. If it doesn't cause any issues with the other tests, then we can merge it
Was this fixed? I could make a simple pull request with the specified changes.
Was this fixed? I could make a simple pull request with the specified changes.
Probably not if it's still open :-) Feel free to give it a shot.
Created a PR: #1571
Necessary information
./ccextractor test.ts -svc all[UTF-16BE] -nofc -12
Video links
http://cdnapi.kaltura.com/p/2035982/playManifest/entryId/1_frxnu0yr/flavorId/1_tr3kiz6l/format/download/a.ts
Additional information
Hi all,
I have some TS file with 708 subtitles in Japanese & Chinese that failed to decode properly. After some debugging, I found that if I patch the function
write_utf16_char
here - https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ccx_decoders_708_output.c#L113 to always output 2 byte chars (I changed the if toif (1)
), and I specify an encoding ofUTF-16BE
, it decodes properly.This code looks off to me, as it creates a mix of 8-bit & 16-bit chars with no clear encoding (it's not UTF-8 and it's not UTF-16...). Maybe when iconv is used, the function should always output 2 byte chars? Or, alternatively, if it would use 2-bytes for ALL chars if there is ANY char that doesn't fit in 1-byte, it would also be ok (but this sounds more complex to do...).
Btw, VLC decodes the Japanese & Chinese properly, after changing the 'preferred closed captions decoder' setting from 608 to 708.
Thanks!
Eran