Kroc / NoNonsenseForum

A free, open source, PHP-based simple discussion forum. It favours removing barriers to conversation rather than massaging egos. Download Here: https://github.com/Kroc/NoNonsenseForum/archive/master.zip
http://camendesign.com/nononsense_forum
Other
247 stars 34 forks source link

Validate UTF-8 bytes #176

Open Kroc opened 11 years ago

Kroc commented 11 years ago

In addition to #175, we should validate all input strings to ensure that:

  1. They are actually UTF-8
  2. There are no invalid byte combinations and such
  3. Perhaps removal of some invalid, undesired byte ranges (e.g. BDI mark)
Zegnat commented 11 years ago

See this Stack Overflow question on validating UTF-8.

Clear options:

  1. the regex by the W3C Internationalization department:

    $string =~
     m/\A(
        [\x09\x0A\x0D\x20-\x7E]            # ASCII
      | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
      |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
      | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
      |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
      |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
      | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
      |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
     )*\z/x;
  2. PHP’s mb_check_encoding()
  3. Comparing the original input to an iconv() of the input: @iconv('UTF-8', 'UTF-8', $string) === $string (source)
  4. And pretty much the same as 3, but using mb_convert_encoding() rather than iconv(): $string === mb_convert_encoding(mb_convert_encoding($string, 'UTF-32', 'UTF-8'), 'UTF-8', 'UTF-32') (source)
Zegnat commented 11 years ago

And then there is the Web Application Component Toolkit wiki on internationalization. More specifically:

  1. Checking UTF-8 for Well Formedness, this has some detail on stuff and mentioned iconv() again as ‘perhaps the wisest’ solution.
  2. Common XML Problem Areas with UTF-8, listing the UTF-8 characters that are invalid in XML (important for NNF storage).
  3. Handling UTF-8 with PHP, giving a list of string functions that might result in corrupted UTF-8 strings. I have no idea how many of these bugs are still in the PHP versions targeted by NNF but it would be good if we can learn from this and make sure we don’t introduce invalid combinations ourselves.
Kroc commented 11 years ago

Thanks for the immensely helpful information.

iconv is not present in all configurations. We can use it if it's there, but we can't depend entirely on it.

Zegnat commented 10 years ago

Adding this here, seems to contain stuff that needs to be read but haven’t gotten around to it yet: UTR #36: Unicode Security Considerations.