Closed lovasoa closed 8 years ago
Please review carefully, I didn't dig into the code very deeply, and may have broken things.
This PR seems to fix #16 as well.
Not tested, but looks like it may be ok. Please create a new category though for invalid characters (maybe 251? Or keep 255 and shift other cats down). A control character is not an invalid character. They are indeed scarce in a text, but they are not forbidden.
I'll review after you change this. :-)
Actually looks like you will have to check all other Single-Character probers to make sure the chosen category is not used any other way. Your change does not fix #16. You only rendered some control codepoints invalid which had the side effect of returning the right encoding.
Also since you are creating a generic meaning shared among all Single-Byte probers, I suggest to create a constant in src/nsSBCharSetProber.h. A 3-letter constant like "ILG" (ILLEGAL) may do the trick.
I don't have too much time to work on this currently. :( But if you create a new category for forbidden characters independent from control characters, I would advise to write some code to handle the latter. I think handling them as invalid characters is a way more educated guess then considering them as frequent as punctuation characters (as it's the case currently).
I think handling them as invalid characters is a way more educated guess.
Well I thought same as you at first, a few weeks ago when I began hacking uchardet. But then we can't assume we are representative to all use cases. For instance, I read in the research paper:
"We found that some web sites send 0x00 inside http stream".
I didn't know this because I don't represent all use cases by myself. It turns out that the NULL character (0x00) is a perfectly valid ASCII (hence most other charsets too) control character and its presence should not be considered invalid. Even worse, if we do, then the same text would actually turn out invalid for all charset detection and uchardet would return no response even though it would have been able to return the right answer if we didn't consider the character as invalid! I am not sure who uses the other ctrl characters, but I can only assume that they have been created for a purpose, and therefore they can be found somewhere out there. Therefore we just can't treat them as invalid. This would break uchardet.
Now though, I think we can create 2 categories of control characters:
I don't have too much time to work on this currently. :(
Ok well if you do find the time, you are welcome. I may work on this otherwise, but I can't tell you when either. Probably not immediately though.
Ok. I fixed what needed be. In particular we don't shortcut detection on control characters. As previously explained, this is a conceptual error. But we will do it on actually invalid codepoints (like in your example in issue #17). Thanks for the patch.
See #17