BYVoid / uchardet

An encoding detector library ported from Mozilla
Other
609 stars 107 forks source link

Stop detection early on control characters #18

Closed lovasoa closed 8 years ago

lovasoa commented 8 years ago

See #17

lovasoa commented 8 years ago

Please review carefully, I didn't dig into the code very deeply, and may have broken things.

lovasoa commented 8 years ago

This PR seems to fix #16 as well.

Jehan commented 8 years ago

Not tested, but looks like it may be ok. Please create a new category though for invalid characters (maybe 251? Or keep 255 and shift other cats down). A control character is not an invalid character. They are indeed scarce in a text, but they are not forbidden.

I'll review after you change this. :-)

Jehan commented 8 years ago

Actually looks like you will have to check all other Single-Character probers to make sure the chosen category is not used any other way. Your change does not fix #16. You only rendered some control codepoints invalid which had the side effect of returning the right encoding.

Jehan commented 8 years ago

Also since you are creating a generic meaning shared among all Single-Byte probers, I suggest to create a constant in src/nsSBCharSetProber.h. A 3-letter constant like "ILG" (ILLEGAL) may do the trick.

lovasoa commented 8 years ago

I don't have too much time to work on this currently. :( But if you create a new category for forbidden characters independent from control characters, I would advise to write some code to handle the latter. I think handling them as invalid characters is a way more educated guess then considering them as frequent as punctuation characters (as it's the case currently).

Jehan commented 8 years ago

I think handling them as invalid characters is a way more educated guess.

Well I thought same as you at first, a few weeks ago when I began hacking uchardet. But then we can't assume we are representative to all use cases. For instance, I read in the research paper:

"We found that some web sites send 0x00 inside http stream".

I didn't know this because I don't represent all use cases by myself. It turns out that the NULL character (0x00) is a perfectly valid ASCII (hence most other charsets too) control character and its presence should not be considered invalid. Even worse, if we do, then the same text would actually turn out invalid for all charset detection and uchardet would return no response even though it would have been able to return the right answer if we didn't consider the character as invalid! I am not sure who uses the other ctrl characters, but I can only assume that they have been created for a purpose, and therefore they can be found somewhere out there. Therefore we just can't treat them as invalid. This would break uchardet.

Now though, I think we can create 2 categories of control characters:

I don't have too much time to work on this currently. :(

Ok well if you do find the time, you are welcome. I may work on this otherwise, but I can't tell you when either. Probably not immediately though.

Jehan commented 8 years ago

Ok. I fixed what needed be. In particular we don't shortcut detection on control characters. As previously explained, this is a conceptual error. But we will do it on actually invalid codepoints (like in your example in issue #17). Thanks for the patch.