Closed bartlomiejb closed 6 hours ago
Interesting observation.
I see that the PHP docs now contain a warning on the filter page, and the same, as a note, on the runtime config page: https://www.php.net/manual/en/filter.configuration.php. Those ini settings can not be set at runtime, so useless to fix this issue in the code.
I assume you've seen comment https://www.php.net/manual/en/filter.filters.sanitize.php#129098 too?
<?php
error_reporting(E_ALL);
function filter_string_polyfill(string $string): string
{
$str = preg_replace('/\x00|<[^>]*>?/', '', $string);
return str_replace(["'", '"'], [''', '"'], $str);
}
$string = "óÓ";
echo filter_var($string,FILTER_SANITIZE_STRING).PHP_EOL;
echo htmlspecialchars($string).PHP_EOL;
echo strip_tags($string).PHP_EOL;
echo htmlspecialchars(strip_tags($string,ENT_QUOTES)).PHP_EOL;
echo filter_string_polyfill($string).PHP_EOL;
Thanks for a quick reply!
Yeah, I saw the comment with this proposed polyfill and I think it is a good one and we can use it instead of the current solution in Security::strip_tags(). Should I prepare a patch or could you do it, please? It would be nice to have it fixed on "1.9/develop". One note however - it is NOT an exact replacement! But I guess it is close enough.
Here is another example piece of code that shows difference (thankfully - no problems with ó/Ó anymore :)):
<?php
$ss = [
"óÓ | T<A>G | Abc < tag > Def | Ghi < Jkl | t<a>g",
"only open <xxx",
"AAA<BBB>>CCC",
];
foreach ($ss as $s)
{
echo "input:\t\t\t$s\n";
echo str_repeat('-', 8*3 + strlen($s)) . "\n";
echo "FILTER_SANITIZE_STRING:\t" . filter_var($s, FILTER_SANITIZE_STRING) . "\n";
echo "strip_tags:\t\t" . strip_tags($s) . "\n";
echo "preg_replace:\t\t" . preg_replace('/\x00|<[^>]*>?/', '', $s) . "\n";
echo "\n";
}
The output is:
input: óÓ | T<A>G | Abc < tag > Def | Ghi < Jkl | t<a>g
--------------------------------------------------------------------------
FILTER_SANITIZE_STRING: óÓ | TG | Abc Def | Ghi
strip_tags: óÓ | TG | Abc < tag > Def | Ghi < Jkl | tg
preg_replace: óÓ | TG | Abc Def | Ghi g
input: only open <xxx
--------------------------------------
FILTER_SANITIZE_STRING: only open
strip_tags: only open
preg_replace: only open
input: AAA<BBB>>CCC
------------------------------------
FILTER_SANITIZE_STRING: AAA>CCC
strip_tags: AAA>CCC
preg_replace: AAA>CCC
Note the difference in first example with FILTER_SANITIZE_STRING
and preg_replace
.
It is all really interesting indeed... I mean, what a mess this is ;) - I have made a little bit of digging in source code of PHP, here are some pointers, if you are interested: first I will just point to / cite some snippets from the implementation of various filters/functions with links to the source code of PHP 8.3.9, and then I will share some conclusions:
ext/filter/filter.c
{ "special_chars", FILTER_SANITIZE_SPECIAL_CHARS, php_filter_special_chars },
https://github.com/php/php-src/blob/PHP-8.3.9/ext/filter/sanitizing_filters.c#L222
void php_filter_special_chars(PHP_INPUT_FILTER_PARAM_DECL)
php_filter_strip(value, flags);
/* encodes ' " < > & \0 to numerical entities */
enc['\''] = enc['"'] = enc['<'] = enc['>'] = enc['&'] = enc[0] = 1;
/* if strip low is not set, then we encode them as &#xx; */
memset(enc, 1, 32);
if (flags & FILTER_FLAG_ENCODE_HIGH) {
memset(enc + 127, 1, sizeof(enc) - 127);
}
php_filter_encode_html(value, enc);
ext/filter/filter.c
{ "full_special_chars", FILTER_SANITIZE_FULL_SPECIAL_CHARS, php_filter_full_special_chars },
https://github.com/php/php-src/blob/PHP-8.3.9/ext/filter/sanitizing_filters.c#L243
void php_filter_full_special_chars(PHP_INPUT_FILTER_PARAM_DECL)
if (!(flags & FILTER_FLAG_NO_ENCODE_QUOTES)) {
quotes = ENT_QUOTES;
} else {
quotes = ENT_NOQUOTES;
}
buf = php_escape_html_entities_ex(
(unsigned char *) Z_STRVAL_P(value), Z_STRLEN_P(value), /* all */ 1, quotes,
/* charset_hint */ NULL, /* double_encode */ 0, /* quiet */ 0);
https://github.com/php/php-src/blob/PHP-8.3.9/ext/standard/html.c#L1103
PHPAPI zend_string *php_escape_html_entities_ex(const unsigned char *old, size_t oldlen, int all, int flags, const char *hint_charset, bool double_encode, bool quiet)
.. lot of code, including some substitutions based on some entity tables... ..
htmlspecialchars():
https://github.com/php/php-src/blob/PHP-8.3.9/ext/standard/html.c#L1345
PHP_FUNCTION(htmlspecialchars)
{
php_html_entities(INTERNAL_FUNCTION_PARAM_PASSTHRU, 0);
}
https://github.com/php/php-src/blob/PHP-8.3.9/ext/standard/html.c#L1316
static void php_html_entities(INTERNAL_FUNCTION_PARAMETERS, int all)
..
bool double_encode = 1;
..
replaced = php_escape_html_entities_ex(
(unsigned char*)ZSTR_VAL(str), ZSTR_LEN(str), all, (int) flags,
hint_charset ? ZSTR_VAL(hint_charset) : NULL, double_encode, /* quiet */ 0);
ext/filter/filter.c
{ "string", FILTER_SANITIZE_STRING, php_filter_string },
{ "stripped", FILTER_SANITIZE_STRING, php_filter_string },
https://github.com/php/php-src/blob/PHP-8.3.9/ext/filter/sanitizing_filters.c#L168
void php_filter_string(PHP_INPUT_FILTER_PARAM_DECL)
...
/* strip high/strip low ( see flags )*/
php_filter_strip(value, flags);
if (!(flags & FILTER_FLAG_NO_ENCODE_QUOTES)) {
enc['\''] = enc['"'] = 1;
}
if (flags & FILTER_FLAG_ENCODE_AMP) {
enc['&'] = 1;
}
if (flags & FILTER_FLAG_ENCODE_LOW) {
memset(enc, 1, 32);
}
if (flags & FILTER_FLAG_ENCODE_HIGH) {
memset(enc + 127, 1, sizeof(enc) - 127);
}
php_filter_encode_html(value, enc);
new_len = php_strip_tags_ex(Z_STRVAL_P(value), Z_STRLEN_P(value), NULL, 0, 1);
https://github.com/php/php-src/blob/PHP-8.3.9/ext/standard/string.c#L4799
/* [[state machine...]] */
PHPAPI size_t php_strip_tags_ex(char *rbuf, size_t len, const char *allow, size_t allow_len, bool allow_tag_spaces)
.. a lot of code removing <..> etc. ..
strip_tags():
https://github.com/php/php-src/blob/PHP-8.3.9/ext/standard/string.c#L4522
PHP_FUNCTION(strip_tags)
...
ZSTR_LEN(buf) = php_strip_tags_ex(ZSTR_VAL(buf), ZSTR_LEN(str), allowed_tags, allowed_tags_len, 0);
...
Few conclusions:
FILTER_SANITIZE_FULL_SPECIAL_CHARS
and FILTER_SANITIZE_SPECIAL_CHARS
is quite different, despite similar names: FULL version ends up in php_escape_html_entities_ex()
which is much more complicated than the version without FULLFILTER_SANITIZE_FULL_SPECIAL_CHARS
indeed runs, more or less, the same as htmlspecialchars()
- however note that FILTER_SANITIZE_FULL_SPECIAL_CHARS
calls php_escape_html_entities_ex() with the (PHP internal) parameter all
= 1
, and htmlspecialchars()
with all
= 0
! I guess that is the reason why the first one replaces ó with oacute, and the second one notFILTER_SANITIZE_STRING
and with strip_tags()
: both ends up calling php_strip_tags_ex()
, but with the different last parameter, called allow_tag_spaces
! I guess that is the reason why they behave differently...Or, better yet, let's use strip_tags()
+ preg_replace()
to replace FILTER_SANITIZE_STRING
: it gives the best results in my tests. Here is an updated example code and its output (I ommited implementing replacing quotes only for brevity, it should be in a final version in Fuel, of course):
<?php
$ss = [
"óÓ | T<A>G | Abc < tag > Def | Ghi < Jkl | t<a>g",
"only open <xxx",
"AAA<BBB>>CCC",
"Some \"' <bizzare> string & to Sanitize < !$@%",
];
foreach ($ss as $s)
{
echo "input:\t\t\t\t$s\n";
echo str_repeat('-', 8*4 + strlen($s)) . "\n";
echo "FILTER_SANITIZE_STRING:\t\t" . filter_var($s, FILTER_SANITIZE_STRING) . "\n";
echo "strip_tags:\t\t\t" . strip_tags($s) . "\n";
echo "preg_replace:\t\t\t" . preg_replace('/\x00|<[^>]*>?/', '', $s) . "\n";
echo "strip_tags + preg_replace:\t" . preg_replace('/\x00|<[^>]*>?/', '', strip_tags($s)) . "\n";
echo "\n";
}
Output:
input: óÓ | T<A>G | Abc < tag > Def | Ghi < Jkl | t<a>g
----------------------------------------------------------------------------------
FILTER_SANITIZE_STRING: óÓ | TG | Abc Def | Ghi
strip_tags: óÓ | TG | Abc < tag > Def | Ghi < Jkl | tg
preg_replace: óÓ | TG | Abc Def | Ghi g
strip_tags + preg_replace: óÓ | TG | Abc Def | Ghi
input: only open <xxx
----------------------------------------------
FILTER_SANITIZE_STRING: only open
strip_tags: only open
preg_replace: only open
strip_tags + preg_replace: only open
input: AAA<BBB>>CCC
--------------------------------------------
FILTER_SANITIZE_STRING: AAA>CCC
strip_tags: AAA>CCC
preg_replace: AAA>CCC
strip_tags + preg_replace: AAA>CCC
input: Some "' <bizzare> string & to Sanitize < !$@%
-----------------------------------------------------------------------------
FILTER_SANITIZE_STRING: Some "' string & to Sanitize
strip_tags: Some "' string & to Sanitize < !$@%
preg_replace: Some "' string & to Sanitize
strip_tags + preg_replace: Some "' string & to Sanitize
If you can create a pull request for it? You deserve the credits for this :grin: .
Security::strip_tags()
breaks Polish (and Czech and few other languages that uses "o" with acute accent, see eg. https://en.wikipedia.org/wiki/%C3%93) texts with the characters ó/Ó. The bug is related with this change: https://github.com/fuel/core/commit/9a3ee9de7a44a38a86d91f8f433bc32bc877bed1FILTER_SANITIZE_STRING
is deprecated in PHP 8, so it has been replaced withFILTER_SANITIZE_FULL_SPECIAL_CHARS
+strip_tags()
. The problem is that - for some reason - filter_var withFILTER_SANITIZE_FULL_SPECIAL_CHARS
flag replaces "ó" with "ó
" and "Ó" with "Ó
" which is undesirable in a method named "strip...something". It doesn't do this with other Polish characters (ĘĄŚŁŻŹĆŃęąśłżźćń
). Here is a test program that shows the problem (pure PHP - national characters should be encoded in UTF-8 in source code and default_charset set to "UTF-8"):The output of the program:
As you can see it is not true that
FILTER_SANITIZE_FULL_SPECIAL_CHARS
is the same as "htmlspecialchars()
withENT_QUOTES
set" as PHP documentation claims.So my proposal is to replace the
filter_var
flagFILTER_SANITIZE_FULL_SPECIAL_CHARS
withFILTER_SANITIZE_SPECIAL_CHARS
inSecurity::strip_tags()
.