CCExtractor / ccextractor

CCExtractor - Official version maintained by the core team
https://www.ccextractor.org
GNU General Public License v2.0
714 stars 425 forks source link

Ver 0.85 CEA-708: 16 bit charset (Korean) Not support #690

Closed gkehstn closed 2 years ago

gkehstn commented 7 years ago

0.78 (2015-12-12)   - CEA-708: 16 bit charset support (tested on Korean). 0.84 test result normal 0.85 Not supported.

cfsmp3 commented 7 years ago

GSoC qualification: 2 points

Izaron commented 7 years ago

Well, I changed this part of code, because in many videos I got wrong output. Link 1 (before my changes) - https://gist.github.com/Izaron/34136a8ec8216469c3c3828acdfbe53e Link 2 (my change) - https://github.com/CCExtractor/ccextractor/pull/623/commits/d60baf18953f1501e2d450fd7e97406cd9624c58 Link 3 (after my changes - absolutely correct) - https://gist.github.com/Izaron/44c030eae8c6c1049ae6d3e6c6d0dd32

Can you please attach your video file with wrong text? If this worked correctly in 0.84 and don't works in 0.85. I will try to fix this error.

HaneolLee commented 7 years ago
  1. When I run it in 0.84 version, Korean is good. link 1 : https://drive.google.com/open?id=0BxFzM3fSXVOiZEo2R1E4MEFFY1U

  2. When I run it in version 0.85, I do not see Korean. link 2 : https://drive.google.com/open?id=0BxFzM3fSXVOiSnVBZkc4RlBzVkE

All run with the same options.

  1. Upload the tested video file. https://drive.google.com/open?id=0BxFzM3fSXVOiV3hUTnVoVVRjeDg
Izaron commented 7 years ago

I wrote a patch Remember you should call it as "ccextractor -svc all[EUC-KR]" or so. Resulting file - https://paste.fedoraproject.org/paste/imMCT5qPdsAk8TlL8Qa35V5M1UNdIGYhyRLivL9gydE=/raw

See issue # 286

Yes, that's bad... I can say I wait for new GSoC student to come and fix it 😄

unicode45 commented 6 years ago

Version 0.85 still can not extract proper Korean characters.

I've attached sample srt files using below samples. https://drive.google.com/drive/folders/0B_61ywKPmI0TZU00VjRYWENfYjg

Files start with Ver079 is correct. 0.85 produce broken characters except ASCII charcters. cea708.zip

ghost commented 6 years ago

Further regressions since 0.85: Using mbc.ts linked above, I get 00:00:01,234 --> 00:00:01,368 җס, Ѩ½ ֧ٮLߺյԄ.
instead of 00:00:01,601 --> 00:00:01,735 뇗랡, 냨쇽 뚧릮샌뻺듵도.

This is caused by using write_utf16_char instead of utf16_to_utf8 in https://github.com/CCExtractor/ccextractor/commit/29180a95b17996f64d2107d8adcb8d773d150921

Attempting fix now.

ghost commented 6 years ago

....mate, I don't know how Korean encoding works, but in the previous versions I'm not getting korean.

Here's a byte-by-byte analysis between .85 and .84 respectively:

EB 87 97 EB 9E A1 2C 20 EB 83 A8 EC 87 BD 20 EB 9A A7 EB A6 AE EC 83 8C EB BB BA EB 93 B5 EB 8F 84 2E

B1 D7 B7 A1 2C 20 B0 E8 C1 FD 20 B6 A7 B9 AE C0 CC BE FA B4 F5 B3 C4

0.84 literally does not produce valid unicode characters, so either it was actually a fix (doubt it, 0.85 produces completely illegible strings of random words) or some other type of encoding apart from unicode. Can someone confirm what exactly Korean 708 subs are in, EUC-KR or UTF16 or something else maybe?

unicode45 commented 6 years ago

Basically EUC-KR is common but both Unicode and EUC-KR can be used. You can find which encoding is used by checking Caption Service Descriptor in PMT. If language field contains 'kor' or 'KOR' and korean_code field is 0, it's unicode(while 1 is EUC-KR).

ghost commented 6 years ago

OK, fairly sure we don't have EUC-KR support and that definitely wasn't EUC-KR since it was on notepad of all things so I'm just as stumped here. I'll work on EUC-KR support, I guess, there's a cool free lib for that but other than that I'm actually stumped since none of these are legible outputs and I have no idea what encoding @HaneolLee used to get that output on 0.84

unicode45 commented 6 years ago

I think ccextractor requires iconv (libiconv) for it. I could convert it by adding "-svc all[EUC-KR]".

ghost commented 6 years ago

Confirming, on latest builds conversions for both samples linked by @unicode45 process successfully if I add -svc all[EUC-KR]

mystery solved

cfsmp3 commented 6 years ago

Can we make it work without the user passing EUC-KR? (i.e. detect the correct encoding ourselves)

On Thu, Dec 28, 2017 at 12:15 AM, Alex Huang notifications@github.com wrote:

Confirming, on latest builds conversions for both samples linked by @unicode45 https://github.com/unicode45 process successfully if I add -svc all[EUC-KR]

mystery solved

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/690#issuecomment-354194378, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrJ2eM0_UgXLdCbRFTtFtVG1KyDHE4Cks5tEs-sgaJpZM4MERHh .

ghost commented 6 years ago

Doesn't seem possible, valid EUC-KR characters are also valid Unicode characters and I reckon it would be very hard to tell automatically what the correct encoding.

cfsmp3 commented 6 years ago

@gray-v did you read this?

Basically EUC-KR is common but both Unicode and EUC-KR can be used. You can find which encoding is used by checking Caption Service Descriptor in PMT. If language field contains 'kor' or 'KOR' and korean_code field is 0, it's unicode(while 1 is EUC-KR).

thetransformerr commented 6 years ago

Hi all ,@unicode45 , @cfsmp3

as I have tested with -svc all it was working fine but as per suggestion above

Basically EUC-KR is common but both Unicode and EUC-KR can be used. You can find which encoding is used by checking Caption Service Descriptor in PMT. If language field contains 'kor' or 'KOR' and korean_code field is 0, it's unicode(while 1 is EUC-KR).

I cannot find any entry or reference towards such an field in PMT , either in code or standard for PMT ISO13818 table 2.24 or it might be the case that I have missed that, would anyone please point out where I can find references to make above changes possible. All I could determine was PMT are used to store program information guide and its table location can be defined for each service in PAT but ISO 13818 recommends it as 0x0002.

following are the lines from code that looks like it but I can't understand how to modify them,

https://github.com/CCExtractor/ccextractor/blob/25a8b53ff55f904f29e4810bdaedd4f154567677/src/lib_ccx/ts_tables.c#L94

please point out what I am missing.....

unicode45 commented 6 years ago

Hi, @thetransformerr

I've found a information but I'm sorry it's written in Korean (Google translation will be helpful). http://www.nl.go.kr/app/nl/search/common/download.jsp?file_id=FILE-00008442489

Here's summary related PMT.

I could not find any Korean subtitle written in Unicode in my experience so far. I hope it will be helpful.

thetransformerr commented 6 years ago

hey @unicode45 ,

Thanks very much for your reply and help , so given that with svc we are able to extract Korean , Wouldn't it be useful if we make svc 1 , EUC-KR as default ? In case of failure , user can provide unicode manually.

unicode45 commented 6 years ago

Hi, @thetransformerr

Wouldn't it be useful if we make svc 1 , EUC-KR as default ? Yes, I think so because all broadcasts were svc 1, EUC-KR in my several years experience.

Thanks,

PunitLodha commented 2 years ago

Wouldn't it be useful if we make svc 1 , EUC-KR as default ?

We cannot default to EUC-KR on all videos, which are in different languages, not just Korean. I think the best solution here is to just manually pass EUC-KR parameter

Same as #286