BYVoid / uchardet

An encoding detector library ported from Mozilla
Other
605 stars 106 forks source link

WINDOWS-1253 file detected as ISO-8859-7 #33

Closed larvanitis closed 8 years ago

larvanitis commented 8 years ago

These two are the most common Greek encodings and they are mostly identical. One major difference between them is the mapping of GREEK CAPITAL LETTER ALPHA WITH TONOS (Ά), which is very common in Greek texts/subtitles.

code | ISO 8859-7                                     | windows-1253
------------------------------------------------------------------------------------------------------
0xA1 | [U+2018] LEFT SINGLE QUOTATION MARK            | [U+0385] GREEK DIALYTIKA TONOS
0xA2 | [U+2019] RIGHT SINGLE QUOTATION MARK           | [U+0386] GREEK CAPITAL LETTER ALPHA WITH TONOS
0xA4 | _unassigned_                                   | [U+00A4] [CURRENCY SIGN]
0xA5 | _unassigned_                                   | [U+00A5] [YEN SIGN]
0xAE | _unassigned_                                   | [U+00AE] [REGISTERED SIGN]
0xB5 | [U+0385] GREEK DIALYTIKA TONOS                 | [U+00B5] [MICRO SIGN]
0xB6 | [U+0386] GREEK CAPITAL LETTER ALPHA WITH TONOS | [U+00B6] [PILCROW SIGN]

Source: ISO 8859-7 vs. windows-1253

I don't know how the detection works but more 0xA2 than 0xB6 would be a strong indication of WINDOWS-1253 (and vice versa).

PS. I use uchardet through mpv for the subtitle language detection and I would say about 10-20% of the subtitles have this problem

Attached sample file

Jehan commented 8 years ago

Thanks for the bug report. Indeed these kind of slight differences between 2 encodings (for the characters of a given language, it doesn't matter much how different there may be on characters of other language in the same encodings) are the most difficult parts.

I don't know how the detection works but more 0xA2 than 0xB6 would be a strong indication of WINDOWS-1253 (and vice versa).

uchardet is mostly statistical (not only, this is a mix of techniques, but statistics are the bigger part, in particular for single-byte encodings like here). The issue with the kind of examples you give (I encountered similar examples for other langs/encodings) is that the alternative is not strictly an error. Indeed I would assume that Greek could have right single quotation mark as well. No? Yet I can foresee some improvements. I'll look into it.

What is this character used for exactly: Ά? In my logs, when I built the Greek language model, this character was not at all in the top used character (less than 0.1%). Now my data was Wikipedia articles. And you seem to say that this character is mostly used in subtitles. Is it not used in "usual" Greek texts other than subtitles? My main problem here would be that I don't have access to thousands of Greek subtitles (whereas I have access to a huge quantity of Wikipedia article), and if I were to use free-of-charge subtitles on the web, I am not even sure of the legality of training the engine on these (since simply downloading them is usually breaking copyright! There are very very few subtitles legally available for download).

PS. I use uchardet through mpv for the subtitle language detection and I would say about 10-20% of the subtitles have this problem

Happy that you still get good detections 80% of the time since the detection system that mpv used to have would not have ever detected your encoding (even remotely since neither encoding are supported by enca). :-)

Jehan commented 8 years ago

Note: in any case, I'll retrain (tomorrow) the Greek engine by forcing some importance on this character and still using Wikipedia data. Hopefully it could be enough to improve the detection.

larvanitis commented 8 years ago

Thanks for the quick reply.

Indeed I would assume that Greek could have right single quotation mark as well.

Yes, indeed.

What is this character used for exactly: Ά?

The greek alpha Αα is equivalent to the latin Aa. It gets accented Άά (like all vowels) where the word is pronounced louder, when the word has at least two syllables (eg if english had the same accenting rule, the leading a in animal would be accented).

Also the capitalization rules are the same as english (first word of a sentence, names etc)

And you seem to say that this character is mostly used in subtitles. Is it not used in "usual" Greek texts other than subtitles?

It is used everywhere but subtitles tend to have a lot of short sentences and names, making the capitalization rules apply more frequently than longer texts such are wikipedia.

Out of curiosity 1: How do you train from wikipedia? Do you get the UTF-8 content and convert it to the various encodings, which you then feed to the algorithm?

Out of curiosity 2: Do you know if mpv analyses the whole subtitle file contents including the format data (timing, markup etc) or just the stripped text which is to be actually displayed?

Jehan commented 8 years ago

Also the capitalization rules are the same as english (first word of a sentence, names etc)

My algorithm tends to lowercase everything anyway (so 'a' and 'A' are the same), which makes things simpler. uchardet does not have any grammatical logics embedded (like what is a "sentence"?). It is purely statistical. Up to now, this does not seem to affect much the quality of the detection (which is very efficient, even though it could obviously be better).

By the way, I was wrong yesterday when I said that 'Ά' was rarely used. I just forgot to search as lowercase. It is actually used nearly 2% of times, which makes it the 16th or 17th (depending on data I used) more used character in Greek texts.

How do you train from wikipedia? Do you get the UTF-8 content and convert it to the various encodings, which you then feed to the algorithm?

Exactly what you say. Obviously I add some max number of page (otherwise it could just go on and on indefinitely).

Do you know if mpv analyses the whole subtitle file contents including the format data (timing, markup etc) or just the stripped text which is to be actually displayed?

Not sure. Obviously stripping the text would lead to more accuracy, but I don't know if they bother (and uchardet stays efficient even with text mixed with some English markup). Moreover it may be not easy to actually strip the markups if you don't even know which encoding the file uses (though on the other hand, I imagine that most markup character would be ASCII, and I don't know if there exists any encoding which is not a subset of ASCII. Yet that could still lead to much more complicated parsing).

So yes, my guess is that they don't strip anything, but I have not actually checked! Feel free to check and tell me. :-)

Jehan commented 8 years ago

Hi @larvanitis,

I have pushed some change. I will want to test this more deeply on various files of other language, so it is not certain that it won't change or even be reverted.

Yet could you test it in the current version and tell me if it improves detection of your various files and subtitles? Thanks!

larvanitis commented 8 years ago

My algorithm tends to lowercase everything anyway (so 'a' and 'A' are the same), which makes things simpler. uchardet does not have any grammatical logics embedded (like what is a "sentence"?). It is purely statistical.

I think that's the culprit in this case. From what you said in your post I conclude that the process:

  1. takes UTF-8 Ά
  2. converts it to UTF-8 ά
  3. converts it for training each encoding to:
    1. WINDOWS-1253 ά (0xDC instead of 0xA2)
    2. ISO-8859-7 ά (0xDC instead of 0xB6)

If you notice, the significant difference from this character gets lost during the lowercase conversion.

I am not sure what a good solution would be but this might affect other languages with similar differences among their encodings, especially iso->microsoft based ones.

mpv... So yes, my guess is that they don't strip anything, but I have not actually checked! Feel free to check and tell me. :-)

I went on and asked on https://github.com/mpv-player/mpv/issues/3180

I have pushed some change. I will want to test this more deeply on various files of other language, so it is not certain that it won't change or even be reverted.

Yet could you test it in the current version and tell me if it improves detection of your various files and subtitles?

I'd be happy to. Where can I get your modified code or binary (even better:)? I have access to Linux and Windows and can compile under the first.

Jehan commented 8 years ago

I think that's the culprit in this case. From what you said in your post I conclude that the process: [...]

No that's not how this works. Lowercasing thing is no problem here. You should not reason in terms of encoding, but of characters. Statistics are language based, not encoding based. In any case, there is no conversion errors here. Considering lower and upper case as different characters is not a solution.

I went on and asked on mpv-player/mpv#3180

Answer is as I thought.

Where can I get your modified code or binary (even better:)?

No binary, but you can get the updated code here on github:

git clone https://github.com/BYVoid/uchardet.git

Then build with cmake.

larvanitis commented 8 years ago

Do you mean I should build the master? Its last commit was March 27.

Jehan commented 8 years ago

Oups sorry! I am slowly moving out of github and did push but to another remote! I updated the github remote as well to the last commits.

larvanitis commented 8 years ago

I tested 5-6 files using uchardet cli command and now they are detected correctly. I am also closing the issue.

Thanks for your time and support!