Closed gkehstn closed 2 years ago
GSoC qualification: 2 points
Well, I changed this part of code, because in many videos I got wrong output. Link 1 (before my changes) - https://gist.github.com/Izaron/34136a8ec8216469c3c3828acdfbe53e Link 2 (my change) - https://github.com/CCExtractor/ccextractor/pull/623/commits/d60baf18953f1501e2d450fd7e97406cd9624c58 Link 3 (after my changes - absolutely correct) - https://gist.github.com/Izaron/44c030eae8c6c1049ae6d3e6c6d0dd32
Can you please attach your video file with wrong text? If this worked correctly in 0.84 and don't works in 0.85. I will try to fix this error.
When I run it in 0.84 version, Korean is good. link 1 : https://drive.google.com/open?id=0BxFzM3fSXVOiZEo2R1E4MEFFY1U
When I run it in version 0.85, I do not see Korean. link 2 : https://drive.google.com/open?id=0BxFzM3fSXVOiSnVBZkc4RlBzVkE
All run with the same options.
I wrote a patch
Remember you should call it as "ccextractor
See issue # 286
Yes, that's bad... I can say I wait for new GSoC student to come and fix it 😄
Version 0.85 still can not extract proper Korean characters.
I've attached sample srt files using below samples. https://drive.google.com/drive/folders/0B_61ywKPmI0TZU00VjRYWENfYjg
Files start with Ver079 is correct. 0.85 produce broken characters except ASCII charcters. cea708.zip
Further regressions since 0.85: Using mbc.ts linked above, I get 00:00:01,234 --> 00:00:01,368
җס, Ѩ½ ֧ٮLߺյԄ.
instead of 00:00:01,601 --> 00:00:01,735
뇗랡, 냨쇽 뚧릮샌뻺듵도.
This is caused by using write_utf16_char instead of utf16_to_utf8 in https://github.com/CCExtractor/ccextractor/commit/29180a95b17996f64d2107d8adcb8d773d150921
Attempting fix now.
....mate, I don't know how Korean encoding works, but in the previous versions I'm not getting korean.
Here's a byte-by-byte analysis between .85 and .84 respectively:
EB 87 97 EB 9E A1 2C 20 EB 83 A8 EC 87 BD 20 EB 9A A7 EB A6 AE EC 83 8C EB BB BA EB 93 B5 EB 8F 84 2E
B1 D7 B7 A1 2C 20 B0 E8 C1 FD 20 B6 A7 B9 AE C0 CC BE FA B4 F5 B3 C4
0.84 literally does not produce valid unicode characters, so either it was actually a fix (doubt it, 0.85 produces completely illegible strings of random words) or some other type of encoding apart from unicode. Can someone confirm what exactly Korean 708 subs are in, EUC-KR or UTF16 or something else maybe?
Basically EUC-KR is common but both Unicode and EUC-KR can be used. You can find which encoding is used by checking Caption Service Descriptor in PMT. If language field contains 'kor' or 'KOR' and korean_code field is 0, it's unicode(while 1 is EUC-KR).
OK, fairly sure we don't have EUC-KR support and that definitely wasn't EUC-KR since it was on notepad of all things so I'm just as stumped here. I'll work on EUC-KR support, I guess, there's a cool free lib for that but other than that I'm actually stumped since none of these are legible outputs and I have no idea what encoding @HaneolLee used to get that output on 0.84
I think ccextractor requires iconv (libiconv) for it. I could convert it by adding "-svc all[EUC-KR]".
Confirming, on latest builds conversions for both samples linked by @unicode45 process successfully if I add -svc all[EUC-KR]
mystery solved
Can we make it work without the user passing EUC-KR? (i.e. detect the correct encoding ourselves)
On Thu, Dec 28, 2017 at 12:15 AM, Alex Huang notifications@github.com wrote:
Confirming, on latest builds conversions for both samples linked by @unicode45 https://github.com/unicode45 process successfully if I add -svc all[EUC-KR]
mystery solved
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/CCExtractor/ccextractor/issues/690#issuecomment-354194378, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrJ2eM0_UgXLdCbRFTtFtVG1KyDHE4Cks5tEs-sgaJpZM4MERHh .
Doesn't seem possible, valid EUC-KR characters are also valid Unicode characters and I reckon it would be very hard to tell automatically what the correct encoding.
@gray-v did you read this?
Basically EUC-KR is common but both Unicode and EUC-KR can be used. You can find which encoding is used by checking Caption Service Descriptor in PMT. If language field contains 'kor' or 'KOR' and korean_code field is 0, it's unicode(while 1 is EUC-KR).
Hi all ,@unicode45 , @cfsmp3
as I have tested with -svc all it was working fine but as per suggestion above
Basically EUC-KR is common but both Unicode and EUC-KR can be used. You can find which encoding is used by checking Caption Service Descriptor in PMT. If language field contains 'kor' or 'KOR' and korean_code field is 0, it's unicode(while 1 is EUC-KR).
I cannot find any entry or reference towards such an field in PMT , either in code or standard for PMT ISO13818 table 2.24 or it might be the case that I have missed that, would anyone please point out where I can find references to make above changes possible. All I could determine was PMT are used to store program information guide and its table location can be defined for each service in PAT but ISO 13818 recommends it as 0x0002.
following are the lines from code that looks like it but I can't understand how to modify them,
please point out what I am missing.....
Hi, @thetransformerr
I've found a information but I'm sorry it's written in Korean (Google translation will be helpful). http://www.nl.go.kr/app/nl/search/common/download.jsp?file_id=FILE-00008442489
Here's summary related PMT.
Page No.25, Chapter B.1 "PMT is an optional value." (I think that's the reason you could not find PMT.)
Page No.25, Chapter B.2 to Page No.28 Described caption service descriptor.
Page No.28, Chapter B.3 "DTVCC Default Mode in Korea : Although DTVCC subtitles data exists in DTVCC transmission channels but PMT and EIT do not have any caption service descriptor, it will be treated as Service 1 and EUC-KR." (So, if you could not find any PMT information on it, please regard it Service 1 and EUC-KR.)
I could not find any Korean subtitle written in Unicode in my experience so far. I hope it will be helpful.
hey @unicode45 ,
Thanks very much for your reply and help , so given that with svc we are able to extract Korean , Wouldn't it be useful if we make svc 1 , EUC-KR as default ? In case of failure , user can provide unicode manually.
Hi, @thetransformerr
Wouldn't it be useful if we make svc 1 , EUC-KR as default ? Yes, I think so because all broadcasts were svc 1, EUC-KR in my several years experience.
Thanks,
Wouldn't it be useful if we make svc 1 , EUC-KR as default ?
We cannot default to EUC-KR on all videos, which are in different languages, not just Korean. I think the best solution here is to just manually pass EUC-KR parameter
Same as #286
0.78 (2015-12-12) - CEA-708: 16 bit charset support (tested on Korean). 0.84 test result normal 0.85 Not supported.