Open Kroc opened 11 years ago
See this Stack Overflow question on validating UTF-8.
Clear options:
the regex by the W3C Internationalization department:
$string =~
m/\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x;
mb_check_encoding()
iconv()
of the input: @iconv('UTF-8', 'UTF-8', $string) === $string
(source)mb_convert_encoding()
rather than iconv()
: $string === mb_convert_encoding(mb_convert_encoding($string, 'UTF-32', 'UTF-8'), 'UTF-8', 'UTF-32')
(source)And then there is the Web Application Component Toolkit wiki on internationalization. More specifically:
iconv()
again as ‘perhaps the wisest’ solution.Thanks for the immensely helpful information.
iconv is not present in all configurations. We can use it if it's there, but we can't depend entirely on it.
Adding this here, seems to contain stuff that needs to be read but haven’t gotten around to it yet: UTR #36: Unicode Security Considerations.
In addition to #175, we should validate all input strings to ensure that: