Multibyte strings, various problems with the lack thereof

mivanov93 commented 9 years ago

I was looking at the code and found that currently the non-mb string functions are being used: ex:

Aura\Filter\Rule\Sanitize\Strlen

if (strlen($value) < $len) {
     $subject->$field = str_pad($value, $len, $pad_string, $pad_type);
}
if (strlen($value) > $len) {
     $subject->$field = substr($value, 0, $len);
}

Here we have strlen, substr and str_pad which are non-mb safe. Which will result in various problems:

Ex. take a cyrillic script string's length - strlen("тесттест") = 16 instead of 8, so our validation will fail if we ask for strings of less than 15 characters.

An even bigger issue will ensue with the substr which may cut a unicode character in half(offtopic: slashing symbols sounds funny though) - we will get a random ascii character.

Shouldn't we instead switch to the mb functions?

mivanov93 commented 9 years ago

Just made some tests and pushed them to my fork repo. So should we add new filters/sanitizers for mb strings or just change the old ones? Strlen vs adding MbStrlen

pmjones commented 9 years ago

I think multibyte support would be good to have. It should probably be supported by default in the existing filters, instead of adding new ones. Questions:

How much attention do we need to pay to mb_internal_encoding() ?
Would it be wise to support a fallback to plain old strlen() etc in case mbstring is not installed?

Thanks for bringing this up.

pmjones commented 9 years ago

I have created a new branch mbstring to work on this. If you can base your PR against that new branch, and modify that PR to PSR-2 standards (e.g., brace on the next line for conditionals) and remove the whitespace changes, I can merge it for testing the feature.

pmjones commented 9 years ago

@mivanov93 ^^

pmjones commented 9 years ago

@mivanov93 I became impatient and built a PR from your branch on my own. There were a couple problems, or perhaps non-problems, in the alpha/alnum/etc filters; they are locale-dependent, so they should work as they are in 2.x if you set your locale properly. You can see the ongoing work in the mbstring branch, and in PR #89 .

mivanov93 commented 9 years ago

Hello, as I see it we have two main problems. First the mbstring support, second - the regexes.

The mb support - mb_internalencoding(), the way I see it, it sets the default encoding. Problem is if we don't explicitly set the encoding for our every call, somehow someone might accidentally set the encoding to something else and all the mb functions will treat all strings as if they were encoded like that.

Another thing of interest is the mbstring.func_overload = 7 which by default replaces all the string functions with their multibyte counterparts, but it seems it's a bad practice, so it's rarely used if at all.

Now about the regexes - I found that instead of using simple stuff we need giant regexes like: [^\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]] instead of simply w. Perhaps we should define some class to map the ascii regexes codes to their unicode counterparts so we need not repeat such monstrosities throughout the code.

Another thing that gets me confused is that there are mb_ regex functions, yet it seems the normal preg functions work as well. https://stackoverflow.com/questions/1725227/preg-match-and-utf-8-in-php https://stackoverflow.com/questions/1766485/are-the-php-preg-functions-multibyte-safe/ But still we should use a u flag or so. The current implementation of PCRE corresponds approximately with Perl 5.12, including support for UTF-8/16/32 encoded strings and Unicode general category properties. However, UTF-8/16/32 and Unicode support has to be explicitly enabled; it is not the default. So it seems to be version/config dependent as well.

The fallback on strlen seems hard as different alphabets define different combinations as characters, hence length will vary. But I guess we could introduce some polyfill functions.

As for the new branch, I will start working on it asap.

mivanov93 commented 9 years ago

About the locale dependent functions, where are we supposed to set the locale? On a hosting that hosts random websites, the one in php.ini may be completely irrelevent. As it is - a global variable(the locale), are you referring to the https://secure.php.net/manual/en/function.setlocale.php function when talking about setting it?

In aura I've only found a validator which validates a string as a valid value for setting the locale with it. But no per filter locale setting or anything.

And isn't it bad to depend on an implicitly set locale?(a random piece of code unrelated to what we are doing might change it plus the warning on php.net about the per process locale seems scary too).

Shouldn't we instead introduce an explicitly way of selecting said locale per filter?

And also provide custom multiple locale dependent filters/sanitizers for said stuff which instead?

Ex. words. I know that some languages may put some symbols like you've mentioned, ex. the ',`,etc. So it's hard to define what constitutes alpha-numeric, alpha, words, etc. But perhaps our best bet would be to define a way for one to select languages which we are interested in.

Ex. I'd like to validate for words in cyrillic(Bulgarian) and latin(English). But I might easily add Spanish and other languages to that list if needed, so choosing just one locale is problematic.

pmjones commented 9 years ago

where are we supposed to set the locale?

I think that's the core question. How "expensive" is setlocale()? Perhaps we need a way for the SubjectFilter to call setlocale() before applying the filter, then setting the locale back to its original value when done. That would cover all the filters being applied to the subject. ValueFilter becomes a little less efficient: since values are filtered one-at-a-time, that means one set & reset for each value, but that might be a fair tradeoff.

Thoughts?

mivanov93 commented 9 years ago

Hm, but how exactly does the locale helps us? we have

ctype_alnum
ctype_alpha
$subject->$field = preg_replace('/[^a-z0-9]/i', '', $subject->$field);
$subject->$field = preg_replace('/[^a-z]/i', '', $subject->$field);
$subject->$field = preg_replace('/\W/', '', $value);

For what it seems, only the $subject->$field = preg_replace('/\W/', '', $value); is affected by the local(since the PCRE say so). The rest is 100% not, as I've read that ctype_alnum and ctype_alpha recognize only the latin characters,

FIX for anyone's reference: They probably are even if not explicitly defined.

So [^a-z] and [^a-z0-9] is not affected.

Shouldn't we instead make new functions which will create a regex based on an array of locales. Example being:

$this->filter->sanitize('foo')->to('word')->encoding('UTF-8')->locale(array('Fr','Us','Es')->asSoftRule();

The main point is we should differ between locale and encoding. As one says about linguistic stuff, the other is about the representation of symbols in bytes.

Now lets talk about something else, from what I've read it seems that PHP is basically encoding agnostic, that is - it looks at strings as byte sequences. But when filtering and operating with said strings, we need know how are they encoded. So how should we go about it? For example let's say we have a mysql database encoded as random encoding say X. So we get strings encoded in X, but when validating them how do we now that they are encoded as X, the mb_ functions use the default system encoding(pointed in php.ini/etc., UTF-8 most of the time). So if we don't tell the mb functions, they will assume we are giving them UTF-8 encoded strings and not X encoded strings. So either depend on the mb_internal_encoding(), but what if we have to operate on two data sets, one encoded in X, the other one in Y. And yes, it is 100% important, because string length for example varies even between UTF-8,16 and 32.

So we should allow per string filter, setting the encoding. Example:

$this->filter->validate('foo')->is('strlenMin')->encoding('UTF-8')->asSoftRule();

So to get things to work correctly, I think we need to allow to also set per subject -> encoding and locale.

By the way, I've just pushed some code to the mbstring branch in my forked repo.

pmjones commented 9 years ago

I've read that ctype_alnum and ctype_alpha recognize only the latin characters

Per the docs on php.net http://php.net/manual/en/function.setlocale.php (search for LC_CTYPE), setting the locale does affect the ctype_* functions.

pmjones commented 9 years ago

Also, let's go with one topic at a time; reading and responding to multiple topics in a single post is a bit too much for me to deal with.

mivanov93 commented 9 years ago

Sure, so for the CTYPE, it's a bit implicit, since nowhere is it stated exactly, but I agree it's probably the case. But even if so - what about the regexes for replacing using the [a-z]?

I will check on the speed penalty of locales.

But now I am thinking about whether we should use them at all, since not all servers have all locales, so it may be better to instead emulate them as I've suggested.

pmjones commented 9 years ago

So we're covered for ctype, then, if we set the locale (and back again).

Next: the regexes. Perhaps the [a-z] etc. can be replaced by [:alpha:] and [:alnum:], per http://php.net/manual/en/regexp.reference.character-classes.php ?

mivanov93 commented 9 years ago

Seems so, but we could also use \p{L} as described here http://www.regular-expressions.info/unicode.html I wonder if there's any other difference between [:alpha:] and \p{L} Ex:

<?php
$a="425в";
$k=preg_match("/[\p{L}]+/u",$a);
$l=preg_match("/[[:alpha:]]+/u",$a);
var_dump($k);
var_dump($l);

both work and are not affected by the locale.

Still it might be nice for people to set which alphabets do they wanna be compatible with.(same as with setlocale idea).

mivanov93 commented 9 years ago

Also for the regexes, I think we should as well use the mb regex functions, so the encoding can be selected. preg_match can do with the /u flag for UTF-8, but I am not sure if it can be used for other encodings with some other flags/etc.

pmjones commented 9 years ago

I like the \p{L} idea. That could replace both validate and sanitize for alpha, and avoids the locale stuff altogether.

mivanov93 commented 9 years ago

The locale stuff is viable only with the Word validation/sanitization, since it's a question of a language whether something can be a word or not.(ie. characters like `,' in it).

So for now perhaps we

Use mb functions for strlen, ..., regexes too(TODO), maybe include some polyfill functions if mb_ is not installed.(TODO)
Set the default encoding of said proxy functions to UTF-8 as I did in my fork.
Allow people to select encoding on a per subject/validate/sanitize way.
Use the setlocale for Word validation/sanitization and maybe allow people to use it per subject/validate/sanitize too. That way if it's set for a whole subject, things should be fine.(still the locale may cause some performance issues).

mivanov93 commented 9 years ago

By the way the currently used preg_match function may lead to security problems. Not sure about it, but still.

As stated in php.net: This function may return Boolean FALSE, but may also return a non-Boolean value which evaluates to FALSE. Please read the section on Booleans for more information. Use the === operator for testing the return value of this function.

And we currently have stuff like return (bool) preg_match($expr, $value); instead of return preg_match($expr, $value); Since if we want to check if something is missing we need it to be equal to 0 and not Boolean FALSE.

So migrating to mb_ereg_match should gives us that added security benefit without making our code uglier.

pmjones commented 9 years ago

@mivanov93 Here's an update on #89:

I have modified your PR #90 somewhat so that the string encoding is automatically detected, converted to UTF-8 for the various strlen/substr/strpad operations, then converted back to the original encoding, along with some other "readability" modifications to the mbstrpad() method. That should avoid having to explicitly set encodings anywhere.

I have converted the "alpha" and "alnum" rules from ctype_* to regular expressions and unicode letter/number characteristics (e.g. \p{L} and \p{Nd}. That should avoid any need to deal with setlocale().

Likewise, I have converted the "word" rules from \w and \W to unicode letter/number characteristics.

@mivanov93 you may wish to add unicode characters back to the alpha/alnum/word tests as a result of this.

/me wipes brow

Provided you find those changes applicable to your use cases, what remains to be done?

mivanov93 commented 9 years ago

@pmjones Just pushed test + fix.

At first things didn't work but, but all current problems were resolved with adding the /u to our preg_matches since we are using unicode.

But this way of testing says nothing about the auto-detection itself, since our strings are saved as UTF-8 in the test files, so we have to test that later too.

By the way, not sure if the \w and \W can be replaced without a locale since some languages use symbols like ' in many words, english as well.

mivanov93 commented 9 years ago

Just made and pushed another commit.

I created two UTF-16 encoded files using Geany, and used them with file_get_contents as test strings.

And it fails and says they are binary strings. https://travis-ci.org/mivanov93/Aura.Filter/builds/71575119

pmjones commented 9 years ago

Rebase your work against the mbstring branch, then make a PR against the mbstring branch, so I can bring your work over and try it out.

pmjones commented 9 years ago

New information: http://php.net/manual/en/function.mb-detect-order.php says "For UTF-16, UTF-32, UCS2 and UCS4, encoding detection will fail always." (/me shakes head)

Time to figure out something else. I switched to encoding detection after finding out that mb_internal_encoding() causes failures on 5.4 and 5.5 on Travis.

Other thoughts on how to automatically detect the encoding, instead of forcing users to detect it themselves and pass it?

UPDATE: Maybe the mb_internal_encoding() failure on 5.4 and 5.5 had to do with http://php.net/manual/en/function.mb-regex-encoding.php "Version 5.6.0: Default encoding is changed to UTF-8. It was EUC-JP Previously."

UPDATE 2: Hm, the failures were not on the strlen-related filters, but only on Sanitize\Alnum and Sanitize\Alpha. We're not actually doing detection in those rules, so we may need to add detection.

mivanov93 commented 9 years ago

I just added the auto-detection to Alnum and Alpha(sanitize rules), but as you've mentioned, the auto detection fails for the UTF-16(in this case UTF-16LE) - https://travis-ci.org/auraphp/Aura.Filter/jobs/71637349

So perhaps we should allow the encoding to be explicitly set and if not - then try to autodetect? On a per field and subject basis perhaps?

And also implement a way to set a default encoding(thus disabling autodetection, if for example I know beforehand all my strings will be UTF-8) so to leverage the performance issues with non-stop auto-detection, when not needed at all.

mivanov93 commented 9 years ago

Something else I just thought about:

In the tests we will need both strings to be UTF-16LE(the sanitize test), otherwise the test will fail because of that, regardless of auto-detection, since the tests works by comparing strings as bit arrays.

I just made it so, but not sure if the test files are correct UTF-16LE and do not contain other stuff.(I will later generate them with PHP just to be sure). Pushed in the fork repo.

mivanov93 commented 9 years ago

Another thing - how is the detection supposed to work? For instance say we have a database, MySQL, we are using UTF-8 collation and UTF-8 connection. So the MySQL thinks everything we are putting in is UTF-8. But then someone submits a string which is for example UTF-16LE. The filter deems it valid, because it looks fine, but then we put that UTF-16LE string in our database and mess things up. Since it won't know to convert it to UTF-8, and it will be stored as something random. Of course the parts which are not deemed legit UTF-8 chars(ASCII too, as it's a subset of UTF-8) will be truncated as per this: https://bugs.mysql.com/bug.php?id=53005

So in that way - isn't it mandatory to allow us to pre-select the encoding we want, and only if we want -> convert to that encoding when sanitizing.

mivanov93 commented 9 years ago

As for the alternative when there is no mb extension, what about iconv?

mivanov93 commented 9 years ago

And alpha and alphanum need to convert encoding since preg_match can only match UTF-8 with a /u flag which I've added in my commits, and thus can't be used for example for UTF-16. Otherwise we should use mb_ereg to use another encoding without converting.

Bad thing is - mb_ereg has no counterpart in iconv. But perhaps polyfill can be made by converting to UTF-8 using iconv and using a normal preg_match.

pmjones commented 9 years ago

Regarding UTF-16 etc. support: how necessary do we think that is? This library may not need to cover absolutely every encoding ever -- it may be enough to cover only the ones detectable by mb_detect_encoding(). While it means the library is imperfect in the sense that it does not handle every case, it would still handle more (9-10 times?) cases than it used to, thus makes it 9-10x better than it used to be. At the worst, doing so enables the primary multibyte case (as I see it, UTF-8), and allows other cases to be solved later.

mivanov93 commented 9 years ago

Agreed. That should be good enough for now.(autodetection of most charsets).

But what about the problem I've mentioned above, that with the auto-detection we might mess up a database or two, because php knows not which encoding you're using and will not auto-convert to UTF-8 in the SQL queries, therefore putting invalid bytes in there.

Shouldn't we implement a way for the user to only get valid utf-8 alpha strings for example.

So for now I think we should use mbstring.(potential iconv fallback).

So all in all as a compromise - what about the following: Keeping the auto-detection but allowing us to disable it and thus force UTF-8.

pmjones commented 9 years ago

After trying several experiments, auto-detection is proving exceptionally difficult to implement. My research leads me to believe that bulletproof detection is either (1) impossible, or (2) so "expensive" as to be not worth it. Even the build-in mb_detect_encoding() fails in unexpected ways.

I am beginning to think that the filter system should be restricted to UTF-8, with warnings and disclaimers to that effect. Perhaps another sanitizer (as yet unwritten) could be in charge of converting from a known charset to UTF-8 as a precursor to the string filters.

Thoughts?

mivanov93 commented 9 years ago

Agreed. For now we should default on UTF-8 and ASCII which is a valid subset of UTF-8, hence all ASCII strings will be valid as UTF-8 and we will have compatibility with old versions of the Rules.

As for the conversion, agreed too, later it may be implemented, but only from known encodings.

So for now all need should be preg_match, preg_replace(both with the u flag), mb_strlen and a mb version of str_pad, which I think only depends of mb_strlen. The rest should be fine.

pmjones commented 9 years ago

Very good. If you don't mind, I may try to switch from mb_* to iconv_* as iconv appears to be more strict about malformed UTF-8.

As a side note, this has been a very enjoyable collaboration for me. Thanks for that. :-)

mivanov93 commented 9 years ago

Iconv is fine too, though I saw some benches from 2008 saying mb was much faster, but I do not know if that's the case currently. But if it's really stricter about the validity, I guess we should use iconv as the main option, maybe fallback on mb in case iconv is missing, though I think most installations of php include both.

It was fun working on this with you for me too. Glad we came to a solution. :)

mivanov93 commented 9 years ago

Hello, I've just pushed some fixes and now I think it's all perfect. By the way from my tests: mb is the fastest one - around 5-10 times faster than iconv. polyfill(no iconv or mb) - is around 2 times slower than iconv. (rough estimates)

pmjones commented 9 years ago

Dude, thanks a ton for all this work. I just merged the running PR. What's your twitter handle? I'd like to credit you there as well.

mivanov93 commented 9 years ago

Thank you too for your cooperation as well, now I can finally filter my mesh of cyrillic/latin stuff. I don't have a twitter though, but thanks for the credit anyway :)

auraphp / Aura.Filter

Multibyte strings, various problems with the lack thereof #87