Stop detection early on control characters

lovasoa commented 8 years ago

See #17

lovasoa commented 8 years ago

Please review carefully, I didn't dig into the code very deeply, and may have broken things.

lovasoa commented 8 years ago

This PR seems to fix #16 as well.

Jehan commented 8 years ago

Not tested, but looks like it may be ok. Please create a new category though for invalid characters (maybe 251? Or keep 255 and shift other cats down). A control character is not an invalid character. They are indeed scarce in a text, but they are not forbidden.

I'll review after you change this. :-)

Jehan commented 8 years ago

Actually looks like you will have to check all other Single-Character probers to make sure the chosen category is not used any other way. Your change does not fix #16. You only rendered some control codepoints invalid which had the side effect of returning the right encoding.

Jehan commented 8 years ago

Also since you are creating a generic meaning shared among all Single-Byte probers, I suggest to create a constant in src/nsSBCharSetProber.h. A 3-letter constant like "ILG" (ILLEGAL) may do the trick.

lovasoa commented 8 years ago

I don't have too much time to work on this currently. :( But if you create a new category for forbidden characters independent from control characters, I would advise to write some code to handle the latter. I think handling them as invalid characters is a way more educated guess then considering them as frequent as punctuation characters (as it's the case currently).

Jehan commented 8 years ago

I think handling them as invalid characters is a way more educated guess.

Well I thought same as you at first, a few weeks ago when I began hacking uchardet. But then we can't assume we are representative to all use cases. For instance, I read in the research paper:

"We found that some web sites send 0x00 inside http stream".

I didn't know this because I don't represent all use cases by myself. It turns out that the NULL character (0x00) is a perfectly valid ASCII (hence most other charsets too) control character and its presence should not be considered invalid. Even worse, if we do, then the same text would actually turn out invalid for all charset detection and uchardet would return no response even though it would have been able to return the right answer if we didn't consider the character as invalid! I am not sure who uses the other ctrl characters, but I can only assume that they have been created for a purpose, and therefore they can be found somewhere out there. Therefore we just can't treat them as invalid. This would break uchardet.

Now though, I think we can create 2 categories of control characters:

the neutral ones, which are likely all the ones between 0x00 and 0x1F because I believe their codepoints are ctrl characters in all single bytes encoding (TODO: check if this is really the case), which means we can't make any conclusion when we find them (hence they are "neutral", which is how they are already processed right now).
the negative ones: all other (rarely used) ctrl characters. Having such codepoints in a text is not wrong per-se, but could be considered a good indication this is likely not the right encoding. Thus we could associate it with a negative weight because it would be a very rare occurrence. This may improve general detection without breaking it in cases of rare but legitimate use.

I don't have too much time to work on this currently. :(

Ok well if you do find the time, you are welcome. I may work on this otherwise, but I can't tell you when either. Probably not immediately though.

Jehan commented 8 years ago

Ok. I fixed what needed be. In particular we don't shortcut detection on control characters. As previously explained, this is a conceptual error. But we will do it on actually invalid codepoints (like in your example in issue #17). Thanks for the patch.

BYVoid / uchardet

Stop detection early on control characters #18