fuel / core

Fuel PHP Framework - The core of the Fuel v1 framework
http://fuelphp.com
802 stars 336 forks source link

Security::strip_tags() breaks Polish (Czech, ...) texts with ó/Ó #2211

Closed bartlomiejb closed 6 hours ago

bartlomiejb commented 6 days ago

Security::strip_tags() breaks Polish (and Czech and few other languages that uses "o" with acute accent, see eg. https://en.wikipedia.org/wiki/%C3%93) texts with the characters ó/Ó. The bug is related with this change: https://github.com/fuel/core/commit/9a3ee9de7a44a38a86d91f8f433bc32bc877bed1 FILTER_SANITIZE_STRING is deprecated in PHP 8, so it has been replaced with FILTER_SANITIZE_FULL_SPECIAL_CHARS + strip_tags(). The problem is that - for some reason - filter_var with FILTER_SANITIZE_FULL_SPECIAL_CHARS flag replaces "ó" with "ó" and "Ó" with "Ó" which is undesirable in a method named "strip...something". It doesn't do this with other Polish characters (ĘĄŚŁŻŹĆŃęąśłżźćń). Here is a test program that shows the problem (pure PHP - national characters should be encoded in UTF-8 in source code and default_charset set to "UTF-8"):

<?php

/*
Fuel - Security::strip_tags():
OLD:
$value = filter_var($value, FILTER_SANITIZE_STRING);

NEW:
$value = filter_var(strip_tags($value), FILTER_SANITIZE_FULL_SPECIAL_CHARS);
*/

error_reporting(E_ALL);

echo 'FILTER_SANITIZE_STRING' . "\n";
echo 'o with acute: ' . filter_var('ó', FILTER_SANITIZE_STRING) . "\n";
echo 'O with acute: ' . filter_var('Ó', FILTER_SANITIZE_STRING) . "\n";
echo 'other Polish characters + "<": ' . filter_var('ĘĄŚŁŻŹĆŃęąśłżźćń<', FILTER_SANITIZE_STRING) . "\n\n";

echo 'FILTER_SANITIZE_FULL_SPECIAL_CHARS' . "\n";
echo 'o with acute: ' . filter_var('ó', FILTER_SANITIZE_FULL_SPECIAL_CHARS) . "\n";
echo 'O with acute: ' . filter_var('Ó', FILTER_SANITIZE_FULL_SPECIAL_CHARS) . "\n";
echo 'other Polish characters + "<": ' . filter_var('ĘĄŚŁŻŹĆŃęąśłżźćń<', FILTER_SANITIZE_FULL_SPECIAL_CHARS) . "\n\n";

echo 'FILTER_SANITIZE_SPECIAL_CHARS' . "\n";
echo 'o with acute: ' . filter_var('ó', FILTER_SANITIZE_SPECIAL_CHARS) . "\n";
echo 'O with acute: ' . filter_var('Ó', FILTER_SANITIZE_SPECIAL_CHARS) . "\n";
echo 'other Polish characters + "<": ' . filter_var('ĘĄŚŁŻŹĆŃęąśłżźćń<', FILTER_SANITIZE_SPECIAL_CHARS) . "\n\n";

/*
From documentation for FILTER_SANITIZE_FULL_SPECIAL_CHARS: (https://www.php.net/manual/en/filter.filters.sanitize.php):
"Equivalent to calling htmlspecialchars() with ENT_QUOTES set"
*/
echo 'htmlspecialchars() with ENT_QUOTES' . "\n";
echo 'o with acute: ' . htmlspecialchars('ó', ENT_QUOTES) . "\n";
echo 'O with acute: ' . htmlspecialchars('Ó', ENT_QUOTES) . "\n";
echo 'other Polish characters + "<": ' . htmlspecialchars('ĘĄŚŁŻŹĆŃęąśłżźćń<', ENT_QUOTES) . "\n\n";

echo 'htmlspecialchars() without flags' . "\n";
echo 'o with acute: ' . htmlspecialchars('ó') . "\n";
echo 'O with acute: ' . htmlspecialchars('Ó') . "\n";
echo 'other Polish characters + "<": ' . htmlspecialchars('ĘĄŚŁŻŹĆŃęąśłżźćń<') . "\n\n";

echo 'strip_tags()' . "\n";
echo 'o with acute: ' . strip_tags('ó') . "\n";
echo 'O with acute: ' . strip_tags('Ó') . "\n";
echo 'other Polish characters + "<": ' . strip_tags('ĘĄŚŁŻŹĆŃęąśłżźćń<') . "\n\n";

echo 'default_charset: ';
echo ini_get('default_charset') . "\n";

The output of the program:

FILTER_SANITIZE_STRING
PHP Deprecated:  Constant FILTER_SANITIZE_STRING is deprecated in /tmp/test_filter_flags.php on line 15

Deprecated: Constant FILTER_SANITIZE_STRING is deprecated in /tmp/test_filter_flags.php on line 15
o with acute: ó
PHP Deprecated:  Constant FILTER_SANITIZE_STRING is deprecated in /tmp/test_filter_flags.php on line 16

Deprecated: Constant FILTER_SANITIZE_STRING is deprecated in /tmp/test_filter_flags.php on line 16
O with acute: Ó
PHP Deprecated:  Constant FILTER_SANITIZE_STRING is deprecated in /tmp/test_filter_flags.php on line 17

Deprecated: Constant FILTER_SANITIZE_STRING is deprecated in /tmp/test_filter_flags.php on line 17
other Polish characters + "<": ĘĄŚŁŻŹĆŃęąśłżźćń

FILTER_SANITIZE_FULL_SPECIAL_CHARS
o with acute: &oacute;
O with acute: &Oacute;
other Polish characters + "<": ĘĄŚŁŻŹĆŃęąśłżźćń&lt;

FILTER_SANITIZE_SPECIAL_CHARS
o with acute: ó
O with acute: Ó
other Polish characters + "<": ĘĄŚŁŻŹĆŃęąśłżźćń&#60;

htmlspecialchars() with ENT_QUOTES
o with acute: ó
O with acute: Ó
other Polish characters + "<": ĘĄŚŁŻŹĆŃęąśłżźćń&lt;

htmlspecialchars() without flags
o with acute: ó
O with acute: Ó
other Polish characters + "<": ĘĄŚŁŻŹĆŃęąśłżźćń&lt;

strip_tags()
o with acute: ó
O with acute: Ó
other Polish characters + "<": ĘĄŚŁŻŹĆŃęąśłżźćń

default_charset: UTF-8

As you can see it is not true that FILTER_SANITIZE_FULL_SPECIAL_CHARS is the same as "htmlspecialchars() with ENT_QUOTES set" as PHP documentation claims.

So my proposal is to replace the filter_var flag FILTER_SANITIZE_FULL_SPECIAL_CHARS with FILTER_SANITIZE_SPECIAL_CHARS in Security::strip_tags().

WanWizard commented 5 days ago

Interesting observation.

I see that the PHP docs now contain a warning on the filter page, and the same, as a note, on the runtime config page: https://www.php.net/manual/en/filter.configuration.php. Those ini settings can not be set at runtime, so useless to fix this issue in the code.

I assume you've seen comment https://www.php.net/manual/en/filter.filters.sanitize.php#129098 too?

<?php
error_reporting(E_ALL);

function filter_string_polyfill(string $string): string
{
    $str = preg_replace('/\x00|<[^>]*>?/', '', $string);
    return str_replace(["'", '"'], ['&#39;', '&#34;'], $str);
}

$string = "óÓ";

echo filter_var($string,FILTER_SANITIZE_STRING).PHP_EOL;

echo htmlspecialchars($string).PHP_EOL;

echo strip_tags($string).PHP_EOL;

echo htmlspecialchars(strip_tags($string,ENT_QUOTES)).PHP_EOL;

echo filter_string_polyfill($string).PHP_EOL;
bartlomiejb commented 2 days ago

Thanks for a quick reply!

Yeah, I saw the comment with this proposed polyfill and I think it is a good one and we can use it instead of the current solution in Security::strip_tags(). Should I prepare a patch or could you do it, please? It would be nice to have it fixed on "1.9/develop". One note however - it is NOT an exact replacement! But I guess it is close enough.

Here is another example piece of code that shows difference (thankfully - no problems with ó/Ó anymore :)):

<?php

$ss = [
        "óÓ | T<A>G | Abc < tag > Def | Ghi < Jkl | t<a>g",
        "only open <xxx",
        "AAA<BBB>>CCC",
];
foreach ($ss as $s)
{
        echo "input:\t\t\t$s\n";
        echo str_repeat('-', 8*3 + strlen($s)) . "\n";
        echo "FILTER_SANITIZE_STRING:\t" . filter_var($s, FILTER_SANITIZE_STRING) . "\n";
        echo "strip_tags:\t\t" . strip_tags($s) . "\n";
        echo "preg_replace:\t\t" . preg_replace('/\x00|<[^>]*>?/', '', $s) . "\n";
        echo "\n";
}

The output is:

input:                  óÓ | T<A>G | Abc < tag > Def | Ghi < Jkl | t<a>g
--------------------------------------------------------------------------
FILTER_SANITIZE_STRING: óÓ | TG | Abc  Def | Ghi
strip_tags:             óÓ | TG | Abc < tag > Def | Ghi < Jkl | tg
preg_replace:           óÓ | TG | Abc  Def | Ghi g

input:                  only open <xxx
--------------------------------------
FILTER_SANITIZE_STRING: only open
strip_tags:             only open
preg_replace:           only open

input:                  AAA<BBB>>CCC
------------------------------------
FILTER_SANITIZE_STRING: AAA>CCC
strip_tags:             AAA>CCC
preg_replace:           AAA>CCC

Note the difference in first example with FILTER_SANITIZE_STRING and preg_replace.

It is all really interesting indeed... I mean, what a mess this is ;) - I have made a little bit of digging in source code of PHP, here are some pointers, if you are interested: first I will just point to / cite some snippets from the implementation of various filters/functions with links to the source code of PHP 8.3.9, and then I will share some conclusions:

ext/filter/filter.c
     { "special_chars",   FILTER_SANITIZE_SPECIAL_CHARS, php_filter_special_chars   },

https://github.com/php/php-src/blob/PHP-8.3.9/ext/filter/sanitizing_filters.c#L222

void php_filter_special_chars(PHP_INPUT_FILTER_PARAM_DECL)
    php_filter_strip(value, flags);

    /* encodes ' " < > & \0 to numerical entities */
    enc['\''] = enc['"'] = enc['<'] = enc['>'] = enc['&'] = enc[0] = 1;

    /* if strip low is not set, then we encode them as &#xx; */
    memset(enc, 1, 32);

    if (flags & FILTER_FLAG_ENCODE_HIGH) {
        memset(enc + 127, 1, sizeof(enc) - 127);
    }

    php_filter_encode_html(value, enc);
ext/filter/filter.c
     { "full_special_chars",   FILTER_SANITIZE_FULL_SPECIAL_CHARS, php_filter_full_special_chars   },

https://github.com/php/php-src/blob/PHP-8.3.9/ext/filter/sanitizing_filters.c#L243

void php_filter_full_special_chars(PHP_INPUT_FILTER_PARAM_DECL)
    if (!(flags & FILTER_FLAG_NO_ENCODE_QUOTES)) {
        quotes = ENT_QUOTES;
    } else {
        quotes = ENT_NOQUOTES;
    }
    buf = php_escape_html_entities_ex(
        (unsigned char *) Z_STRVAL_P(value), Z_STRLEN_P(value), /* all */ 1, quotes,
        /* charset_hint */ NULL, /* double_encode */ 0, /* quiet */ 0);

https://github.com/php/php-src/blob/PHP-8.3.9/ext/standard/html.c#L1103

PHPAPI zend_string *php_escape_html_entities_ex(const unsigned char *old, size_t oldlen, int all, int flags, const char *hint_charset, bool double_encode, bool quiet)
    .. lot of code, including some substitutions based on some entity tables... ..

htmlspecialchars(): https://github.com/php/php-src/blob/PHP-8.3.9/ext/standard/html.c#L1345

PHP_FUNCTION(htmlspecialchars)
{
    php_html_entities(INTERNAL_FUNCTION_PARAM_PASSTHRU, 0);
}

https://github.com/php/php-src/blob/PHP-8.3.9/ext/standard/html.c#L1316

static void php_html_entities(INTERNAL_FUNCTION_PARAMETERS, int all)
    ..
    bool double_encode = 1;
    ..
    replaced = php_escape_html_entities_ex(
        (unsigned char*)ZSTR_VAL(str), ZSTR_LEN(str), all, (int) flags,
        hint_charset ? ZSTR_VAL(hint_charset) : NULL, double_encode, /* quiet */ 0);
ext/filter/filter.c
     { "string",          FILTER_SANITIZE_STRING,        php_filter_string          },
     { "stripped",        FILTER_SANITIZE_STRING,        php_filter_string          },

https://github.com/php/php-src/blob/PHP-8.3.9/ext/filter/sanitizing_filters.c#L168

void php_filter_string(PHP_INPUT_FILTER_PARAM_DECL)
...
    /* strip high/strip low ( see flags )*/
    php_filter_strip(value, flags);

    if (!(flags & FILTER_FLAG_NO_ENCODE_QUOTES)) {
        enc['\''] = enc['"'] = 1;
    }
    if (flags & FILTER_FLAG_ENCODE_AMP) {
        enc['&'] = 1;
    }
    if (flags & FILTER_FLAG_ENCODE_LOW) {
        memset(enc, 1, 32);
    }
    if (flags & FILTER_FLAG_ENCODE_HIGH) {
        memset(enc + 127, 1, sizeof(enc) - 127);
    }

    php_filter_encode_html(value, enc);

    new_len = php_strip_tags_ex(Z_STRVAL_P(value), Z_STRLEN_P(value), NULL, 0, 1);

https://github.com/php/php-src/blob/PHP-8.3.9/ext/standard/string.c#L4799

/* [[state machine...]] */
PHPAPI size_t php_strip_tags_ex(char *rbuf, size_t len, const char *allow, size_t allow_len, bool allow_tag_spaces)
    .. a lot of code removing <..> etc. ..

strip_tags(): https://github.com/php/php-src/blob/PHP-8.3.9/ext/standard/string.c#L4522

PHP_FUNCTION(strip_tags)
...
    ZSTR_LEN(buf) = php_strip_tags_ex(ZSTR_VAL(buf), ZSTR_LEN(str), allowed_tags, allowed_tags_len, 0);
...

Few conclusions:

bartlomiejb commented 2 days ago

Or, better yet, let's use strip_tags() + preg_replace() to replace FILTER_SANITIZE_STRING: it gives the best results in my tests. Here is an updated example code and its output (I ommited implementing replacing quotes only for brevity, it should be in a final version in Fuel, of course):

<?php

$ss = [
        "óÓ | T<A>G | Abc < tag > Def | Ghi < Jkl | t<a>g",
        "only open <xxx",
        "AAA<BBB>>CCC",
        "Some \"' <bizzare> string & to Sanitize < !$@%",
];
foreach ($ss as $s)
{
        echo "input:\t\t\t\t$s\n";
        echo str_repeat('-', 8*4 + strlen($s)) . "\n";
        echo "FILTER_SANITIZE_STRING:\t\t" . filter_var($s, FILTER_SANITIZE_STRING) . "\n";
        echo "strip_tags:\t\t\t" . strip_tags($s) . "\n";
        echo "preg_replace:\t\t\t" . preg_replace('/\x00|<[^>]*>?/', '', $s) . "\n";
        echo "strip_tags + preg_replace:\t" . preg_replace('/\x00|<[^>]*>?/', '', strip_tags($s)) . "\n";
        echo "\n";
}

Output:

input:                          óÓ | T<A>G | Abc < tag > Def | Ghi < Jkl | t<a>g
----------------------------------------------------------------------------------
FILTER_SANITIZE_STRING:         óÓ | TG | Abc  Def | Ghi
strip_tags:                     óÓ | TG | Abc < tag > Def | Ghi < Jkl | tg
preg_replace:                   óÓ | TG | Abc  Def | Ghi g
strip_tags + preg_replace:      óÓ | TG | Abc  Def | Ghi

input:                          only open <xxx
----------------------------------------------
FILTER_SANITIZE_STRING:         only open
strip_tags:                     only open
preg_replace:                   only open
strip_tags + preg_replace:      only open

input:                          AAA<BBB>>CCC
--------------------------------------------
FILTER_SANITIZE_STRING:         AAA>CCC
strip_tags:                     AAA>CCC
preg_replace:                   AAA>CCC
strip_tags + preg_replace:      AAA>CCC

input:                          Some "' <bizzare> string & to Sanitize < !$@%
-----------------------------------------------------------------------------
FILTER_SANITIZE_STRING:         Some &#34;&#39;  string & to Sanitize
strip_tags:                     Some "'  string & to Sanitize < !$@%
preg_replace:                   Some "'  string & to Sanitize
strip_tags + preg_replace:      Some "'  string & to Sanitize
WanWizard commented 2 days ago

If you can create a pull request for it? You deserve the credits for this :grin: .