Sanitization - Githubissues

Respect / Validation

The most awesome validation engine ever created for PHP

https://respect-validation.readthedocs.io

MIT License

5.8k stars 774 forks source link

Sanitization #118

Closed therik closed 9 years ago

therik commented 11 years ago

Hello guys, I see some cool stuff going on here, but I didn't find any way to actually change the data. I guess the sole purpose of this is to return true/false + info "why false". I didn't dig deep, but I see no mark of sanitization of data. It's kind of shame, I was looking for an alternative for my own crappy hackish recursive array sanitizer, because it has horrible api. But your api would be very cool for sanitizing.

nickl- commented 11 years ago

@therik Glad you like what you see and kudos for sharing your suggestions. We all know those "crappy hackish recursive" spectacles you talk about but hopefully, if we only get one thing right, we manage to make the developers everyday tasks just a little easier.

Your initial assessment is correct that Respect/Validation provides the means to validate the data and produce a set of referenceable exception messages for you to correct. This could aid as basis for part of your requirement that being the identification of exceptions which leaves us to address the 2nd part of the requirement, what to do about that.

Perhaps if you provide us with a bit more information on how you go about deciding on the appropriate modification to apply to the data, it may help in devising a fluent strategy for defining the rules. We think best in interfaces which we find simplest to express our requirements with. I.o.w. based on what you've learnt about how the Validator works, try and express in source code how you see the implementation of a sanitizer can work. You are free, at this point to explore your fantasies so pay no attention on how it would be accomplished just how you would ideally like to use it. Make sense?

I would give you an example but I don't want my assumptions to cloud your vision, this is your opportunity to reach for the skies and dream big. You can browse other issues and you will see how we generally discuss solutions for ideas but feel free to ask if you are not sure how to proceed and I can sweep the archive for an example but there are really no rules, just write the code and we can adapt it to something that we can implement. Show us what you have in mind.

therik commented 11 years ago

Ok, I didn't think much yet, but I got this ideas for a simple string. Those are just ideas, they are inconsistent, if you implement one, another won't work. I will try to clarify it if necessary, but by default assume that those ideas are separate. If we make the order of methods in chain significant and add some filtering methods, we could write things like:

$boolean = v::string()->equals('foo')->trimSpaces()->validate('   foo   '); // returns TRUE

validate() still returns boolean, but ' foo ' is trimmed before it is compared with 'foo'. trimSpaces would simply change the value, instead of validating it. This wouldn't be a big change. The problem is, the order of operations in this case is from right to left, which is opposite of the direction of those arrows and direction of natural flow of language, but it makes sense because the validate() is on the right side of this expression and the result value is going to the left. To fix this (partialy), we could put the validate() to the first place and make the order of operations from left to right, like:

$bool = v::validate('    Hello world   ')->string()->trimSpaces()->strReplace('world', 'Betty')->equals('Hello Betty'); // true

You should still be able to pre-create the validator and fill the validate() later:

$is_asd = v::string()->trimSpaces()->equals('asd');
$bool = $is_asd->validate('   asd '); //true

Maybe we could add some kind of "dummy" method (like create() ) on the first position, to avoid the confusion about whether the validate() method is there or not, like:

$is_asd = v::create()->trimSpaces()->...

with the create() method on the first position, this would simply return the validator for later use.

This option (operations are done in left-to-right order) is imho a bit easier to follow, then the right-to-left method, but it might break a lot of compatibility. In any case, the order of operations has to be significant if the data changes during the validation.

All this was just glorified validation, so let's look at sanitization. (From now on, I'm going to use the second method in those examples, operations done from left to right, left-first, right-last).

Now when the chain can change values, just add another method that gives the resulting value instead of boolean value, let's call the method filter():

$trimmed = v::filter('   foo   ')->trimSpaces(); // returns 'foo'

Of course, sanitizing and validating operations should be freely mixable, but with filter(), it's not possible to return true/false from failed validation, instead we have to specify default value that is returned when some validation method fails, using another method:

$trimmed = v::filter('   foo   ')->default(null)->trimSpaces()->equals('foo'); // returns 'foo'
$trimmed = v::filter('   bar   ')->default(null)->trimSpaces()->equals('foo'); // returns null

Or better solution, as a second parameter for filter() method:

$trimmed = v::filter('   foo   ', null)->trimSpaces()->equals('foo'); // returns 'foo'
$trimmed = v::filter('   bar   ', null)->trimSpaces()->equals('foo'); // returns null

Bonus feature: use the default() method to define different default values based on what validation went wrong:

$HelloBetty = v::filter('   Hello world   ', 'Goodbye')
    ->default(null, v::string())
    ->trimSpaces()
    ->strReplace('world', 'Betty')
    ->equals('Hello Betty'); // returns "Hello Betty"

$HelloBetty = v::filter('   blah blah blah   ', 'Goodbye')
    ->default(null, v::string())
    ->trimSpaces()
    ->strReplace('world', 'Betty')
    ->equals('Hello Betty'); // returns "Goodbye", that's the main default value

$HelloBetty = v::filter( array() , 'Goodbye')
    ->default(null, v::string())
    ->trimSpaces()
    ->strReplace('world', 'Betty')
    ->equals('Hello Betty'); // Seems like it should return null

This last example is a bit tricky, result depends on whether default() breaks the chain and returns immediately or just change the value to default. If it breaks the chain, the result value would be null. But if it just changes the value to null and doesn't stop the chain, the null value is passed on, trimSpaces() and strReplace() would do nothing except maybe casting it to string (or throw exception?) and equals() would fail, so the overall return value would be 'Goodbye'. There can be third parameter in default() for this specific scenario, to explicitly specify the behavior:

...->default(<value> , <v::condition>, (bool) <continue = false> )-> ....

Well, that's all I could think of now. Once strings are ready, we can think about arrays, there's more fun in arrays. I can send you my validator, its only about 200 lines of recursive functions, but I have no tests for it and there's not a single comment. I should write some comments there first. Well and it's not really a validator, it can only sanitize, it always returns the result value or default value, with "default default" value being null.

nickl- commented 11 years ago

@therik Love it =) you rock!

Let me digest it before I comment but you're likely to hear from the others @Respect quite soon as well.

alganet commented 11 years ago

Putting the ->validate in the beginning of the chain is very tricky, 'cause there is no way to actually find when the chain ends so you can return true or false or when you should return $this to more chain nodes.

The Respect\Conversion came to my mind in this issue as well http://github.com/Respect/Conversion. It handle arrays pretty well, but is far from finished.

I also need some time to think of the filtering API, but I do have some convictions:

We should use the works "filter" and "sanitize" with the same meaning as php.net/filters.
We should reuse the current rules if possible.

Initially, I think of a naive solution. This is my suggestion:

<?php

$validator = v::string()->alnum('_')->noWhitespace();

$validator->validate('foo # bar '); //false
$filtered = $validator->filter('foo # bar '); // empty string ''
$sanitized = $validator->sanitize('foo # bar '); // string 'foobar'

Essentially, each current rule will have filter() and sanitize() provided by Filterable and Sanitizable interfaces when applicable. Some new rules for sanitizing only or filter only may be created.

There is no way to sanitize some things, like v::isFile(). When a rule doesn't implement one of the *able validation interfaces, it is ignored. The idea is that you could use the same object for validating (checking if it's ok), sanitizing (cleaning data) and filtering (letting only valid data pass).

therik commented 11 years ago

I see. Well there can be one get() method at the end to indicate the end of a chain. Or, there can be assign('data') to put the data in and validate(),filter() and sanitize() methods would take no arguments and just indicate the job and end of the chain. That way this would be possible:

$is_valid = v::assign('data')->noWhitespace()->validate(); //true

//with premade validator
$validator = v::string()->noWhitespace(); 
$is_valid = $validator->assign('   data   ')->validate(); //false
$filtered = $validator->assign('   otherData   ')->filter(); // 'otherData'

// and data reuse
$is_valid = $validator->validate(); //false,  '   otherData   '  is still assigned

I just realised that it would not be possible to mix filtering and validating in one chain (like stripping spaces and then comparing) if the action was determinated once for the whole chain. There can be a naming convention, like valNoSpaces() for validating and sanNoSpaces() for sanitizing or filtering, but that's a lot of broken compatibility. An alternative can be putting parts of the chain as a parameter to the validate() or sanitize(), something like this:

$validator = v::assign('   data  ')->sanitize(v::noSpaces())->validate(v::equals('data'));
$bool = $validator->isValid(); // true
$result = $validator->getResult(); // 'data'

But that's quite messy.

alganet commented 11 years ago

The idea is to have a single rule tree and three different interfaces operating them while enjoying some abstract code in between.

All the logic Rules\AllOf has to iterate over it's children would be recycled, but the Filterable interface when called on it's defined filter() method would carry on value substitution instead of validation. Here is a humble naive v::int() sample implementation:

<?php

namespace Respect\Validation\Rules;

use Respect\Validation\Validatable;
use Respect\Validation\Filterable;
use Respect\Validation\Sanitizable;

class Int extends AbstractRule implements Validatable, Filterable, Sanitizable
{
    public function validate($input)
    {
        return is_int($input); //simplified for example
    }

    public function filter($input)
    {
        // This could also be the AbstractRule default implementation
        return $this->validate($input) ? $input : null;
    }
    public function sanitize($input)
    {
        return (int) $input;
    }
}

We could have use ...Validator as v, ...Filter as f for exclusive builders and also a use ...Rule as r to imply that objects built on r can be sanitizers, filters and validators at once. I have no idea how to build them yet, but it is a common side effect of this refactoring and can be turned into a feature :) we'll learn how and improve to use it. Note however that the object is the same, but the flow is separate. The user has to call validate(), filter(), and sanitize separately (or, in steps, if built with v::assign()).

My suggestion is based on the current architecture, it's based on following and extending it considering the current limitations.

I really liked the v::assing('some input')->someChain()->validate(); flow. Perhaps v::with() to be more concise with the overall declaration. But I don't really care 'cause I know I suck at naming =D

The isValid() and getResult() are also nice simple additions that could provide a consistent alternative API and I like it.

Allso, let me thank you for your suggestions! It's really awesome to have someone providing so rich feedback. Reminds me of @nickl- when he got involved =D

therik commented 11 years ago

Well, you're welcome :) It's not really a feedback, since I didn't use it yet.

First, I have to admit that I don't really get the difference between sanitize and filter, since both are supposed to let good data pass and drop bad data. If I stick with your int() example, although the filter and sanitize methods are different, the int()->sanitize() would have the same functionaity as ctype_digit()->filter() for example. Maybe I'm wrong, but I see just two methods for each rule: "validate" and ( "sanitize" or "filter" ), whatever we name it.

Another thing that comes to my mind is how to express which of those two methods is used for each method in a chain: If there is v::with('data')-> -> -> ->validate() for validating and v::with()-> -> ->filter() for something else, the whole chain is going to work in validating mode or filtering mode, which might not be what you want. Example: You might want to make sure the data is string (validate), strip the data off of trailing spaces (sanitize), make sure there's no html element inside (validate), substitute {{name}} with some other string (sanitize), make sure it matches some regex(validate), then translate some characters into html entities(sanitize), and add slashes (sanitize). If you can only set the validate/sanitize behavior once for the whole chain, you will have to split the chain into many tiny chains. No doubt it will still be very useful, but you'd loose lot of functionality. Imho the biggest killer features for string sanitization would be the ability to freely mix sanitization rules with validation rules and the possiblity to set "default values" for validation rules, effectively turning validation rules into sort of a ternary operator.

therik commented 11 years ago

I just got an idea how to fit all the goodness in the api without changing the current rules. It's bit of a rape, but I'm just posting it here and you tell me how much of a rape it is.

We can define data input method with, data output methods (getValid, getResult...) and several execution control methods that would share some common characteristics:

Each of those should have very short name, since they are to be repeated a lot, one character would be the best ('v' for validate, d for default, s for sanitize, possibly 'f' for filter, etc..)
Each of those should be able to work in two modes:
- without any chain as a parameter, will affect the rest of the chain from now on until the end of the chain, or until another change is made further in the chain
- with another chain as a parameter, simmilar to ->not(v::chain())-> will affect only the chain in parentheses

if you call getResult or getValid, the chain would behave the same way, except in one case, it would return the resulting value and in the other, it would return true/false. So the getWhatever() would not affect the execution whatsoever, it would only choose what info is returned.

Rules would have several APIs:

"validation" would not change the data, it would only store true/false in some array, in case getValid is called at the end of a chain, this array's values would be ANDed together and returned. Same like the current behavior of the validator
"sanitization" would sanitize
"ternary" or "defaulting" Well, in the first concept, I though the validation rules would act as ternary (sort of, "if-not-valid-then-change-to-default" kind of ternary) in case they are used as sanitization rules, but that would require different sets of methods for validation and sanitization. Ternary is not a good name because it's not really same as php's ternary operator, but it's simmilar idea.

Which of those api would be used will be determined by those control methods.

Perhaps it's best explained on examples.

simplest three cases are:

// purely validating chain, assuming v() for validate:
v::with('data')->v()->string()->noSpaces()->getValid() // true
v::with('   data')->v()->string()->noSpaces()->getValid() // false, noSpaces fails
v::with( 123 )->v()->string()->noSpaces()->getValid() // false, string fails
// small note: since nothing else except validation is going on here, the original value remains unchanged and getResult, if used, would return the unchanged value

// purely sanitizing chain, assuming s() for sanitize
v::with('   data   ')->s()->string()->noSpaces()->getResult() // 'data', noSpaces stripps the spaces off
v::with(123)->s()->string()->noSpaces()->getResult() // '123', string() converts it to string
// small note: getValid(), if used, would return true, since no validation is going on here.

// purely "defaulting" chain, assuming d( $default = null ) for default
v::with('data')->d()->string()->noSpaces()->getResult() // 'data'
v::with('   data   ')->d()->string()->noSpaces()->getResult() // null
v::with('   data   ')->d(false)->string()->noSpaces()->getResult() // false
v::with('   data   ')->d('whatever')->string()->noSpaces()->getResult() // 'whatever'
// small note: again, getValid() would return true

The important thing is, those short-named methods, in this case, change the behavior of the following rules only, not preceeding rules. So you can make chain that -for example- validates an input after the input was sanitized:

v::with('data')->s()->noSpaces()->v()->string()->equals('data')->getValid(); // true
v::with('   data   ')->s()->noSpaces()->v()->string()->equals('data')->getValid(); // true, noSpaces() works in sanitization mode
v::with(123)->s()->noSpaces()->v()->string()->equals('data')->getValid(); // false, string() fails (assuming noSpaces() doesn't convert the data to string)
v::with('   123   ')->s()->noSpaces()->v()->string()->equals('data')->getValid(); // false, equals() fails
// note: if getResult was used, return values would be 'data', 'data', 123 or '123', '123' respectively.

Another useful thing would be changing the data to default value if it doesn't match, combined with sanitization:

v::with('data')->s()->noSpaces()->d()->string()->getResult(); // 'data'
v::with('   data   ')->s()->noSpaces()->d()->string()->getResult(); // 'data'
v::with(123)->s()->noSpaces()->d()->string()->getResult(); // null
v::with('123')->s()->noSpaces()->d()->string()->getResult(); // '123'
v::with('   123   ')->s()->noSpaces()->d()->string()->getResult(); // '123'
v::with(new stdClass)->s()->noSpaces()->d('foo')->string()->getResult(); // 'foo'

I think you get it, that's the basics.

You can pretty much change the behavior wherever in the chain you want, by putting ->*()-> in the place. In case you are in a situation where you have some really long chain using one operation, but sometimes you want just one or two rules to use another operation, that's when the alternative way comes into action:

v::with('data')->d()->string()->s(v::noSpaces())->equals('data')->getResult(); // 'data'
v::with(array())->d()->string()->s(v::noSpaces())->equals('data')->getResult(); // null, string() converts it to default
v::with('    data   ')->d()->string()->s(v::noSpaces())->equals('data')->getResult(); // 'data'; noSpaces() works in sanitize mode
v::with('bar')->d()->string()->s(v::noSpaces())->equals('data')->getResult(); // null, equals() converts it to default null, only the chain in s()'s parametter is sanitizing, if the s() contains a chain as a parametter

From now on, you can go wild: You can stick those behavior changing methods serialy, you can easily make an exception in middle of a long chain, you can even stack them inside each other - make small exception inside longer exception inside some really long chain, etc...

One last thought I am thinking about is to use public properties _result and _valid (or magic __get, to make them read-only), instead of getResult() and getValid(), theese might be updated after each rule and you can harvest it later.

That's pretty much it. So, is the api understandable, or is it way too much? :)

therik commented 11 years ago

There was no response for quite a while, so I tried to make some preview: https://github.com/therik/ValSan It's just the concept proposition, nothing serious. And there's no description, look at tests for how to use it. I'm thinking about how to implement arrays now.

nickl- commented 11 years ago

@therik do you mind forking and applying the changes to a pull request, it is very difficult to assess the impact with your current example.

therik commented 11 years ago

@nickl- I can, but I don't know your code, it will take some time for me to get oriented. I wanted to make preview of the api, try different ways, see the ease of use and stuff, without having to worry about too many things. I did it, sort of, but didn't sort arrays out well, yet. I was waiting for your comments, but you disappeared for a month or so :)

nickl- commented 11 years ago

@therik but you disappeared for a month or so

My bad, life called and this slipped past my immediate attention and focus, I apologize. This is a brilliant concept which I do find myself entertaining in the back of my head on frequent occasion.

I can, but I don't know your code, it will take some time for me to get oriented.

Is that not a fork of Respect\Validation in the project you submitted for review? It is not immediately clear, without a proper diff (which comes as added benefit if we do this via a fork/pull request) of the modifications you made, which makes commenting (another benefit of pull requests) and review rather difficult.

If it is a fork you can simply copy the content into an actual fork and submit a pull request which will make the differences evident. This seems like the easiest approach for everyone and don't worry if it doesn't follow our architecture, this will provide the platform by which to accomplish that as well. Alternatively please point me in the general direction where you want me to focus on.

therik commented 11 years ago

It's not a fork, it's rather a preview of the api, quickly hacked together just to see what happens. I looked in your code for a moment, then I quickly realised it might take me quite a while to figure out where to even start, so I just made part of the validating/sanitizing engine from scratch. It's to test different ideas, see what works, what doesn't work, and it's an exercise for me to get the idea about what's possible with flowing api as well. There are currently only very few rules, it doesn't give any error messages back, it's lacking all kind of features which I just dropped to make it simpler for a moment. Just look at tests to see how to use it. It's its only purpose, to show "how to use it". edit: I just looked at it and it doesn't seem to look as obvious to me as it did before, so 'Im gonna write some readme, so you don't have decipher those tests that much.

nickl- commented 11 years ago

@therik =) perhaps copy paste those functions here so we can look at it in context. Is that possible?

therik commented 11 years ago

@nickl- what functions do you mean?

nickl- commented 11 years ago

@therik I was referring to the engine you wrote from scratch, hoping to increase exposure and simplify revue of your suggestions. Is there anyway you can expose only your changes, or the highlights at least, without requiring everyone to dig through a whole project of source code of which some have nothing to do with the sanitization, from what I can tell?

I stil have to find a moment to look at this closer so I may be mistaken. Is the implementation the exact same thing as you explained in your post dated 27 Feb? Ideally a pull request or patch would make things easier, if that is not possible simply providing the code blocks in a post will at least help simplify the access and effort required. I should get some time by weeks end, hopefully, so we can put this under the magnifying glass in whatever form or shape. =)

Whatever happens, the only way to get it committed is via a pull request, there is no way around that I am afraid. Unless you have a better idea?

therik commented 11 years ago

It's not forked, so if I make pull request, you would be looking at diffs of two different projects. That makes no sense to me.

There are no changes to highlight, because everything is a change, its written from scratch. And it's just the sanitization/validation stuff, there's nothing else in there.

The code might be messy, but don't get discouraged, the code isn't important. It's just quick trial to see the api working. It was never supposed to be finished product. I want to implement this in your engine later, but I wanted to see how the api would work, before I start digging into it for real. You can see the important stuff in readme.

Now I see it actually caused more confusion than clarification. Think about it as "living example", because it really is just an example, that somehow works.

By the way, can we use some faster way of comunication here, like IRC for example?

nickl- commented 11 years ago

@therik It's not forked, so if I make pull request, you would be looking at diffs of two different projects. That makes no sense to me.

Nope none to me neither, would hardly solve the problem of identifying only the change, would you agree.

By the way, can we use some faster way of comunication here, like IRC for example?

I am currently pressed for time, hence the request to simplify this review but you are always welcome to visit us at #php-respect you should find someone there on occasion. The panda bot crashed well my HDD crashed and with it the bot and I do still have to get that back too... I just don't seem to get time for everything. My bad =(

I do see a use for this, we must just find opportunity, I hope you can bear with us. Please! Please

mf commented 11 years ago

I think I am beating a dead horse here, but for what it is worth I am against letting Respect/Validation actually modify any data. This has nothing to do with the original intent of the package - which is chained validation requests.

Data mutation has nothing to do with this project, in my own (admittedly) biased opinion.

therik commented 11 years ago

@mf I think it would be better to do it as separate project, there might be many people who don't like modifications in validation.

mf commented 11 years ago

Agreed. You can fork the library and take it into modification realm. I suggest closing the issue meanwhile. Thanks!

alganet commented 11 years ago

Well, let's explore the idea of a data sanitization/filter only library aligned (although not coupled) enough with Validation so we can use both in the same chain =)

About the chain changes in Respect\Validation, we should discuss that here as well. There are some pretty good API ideas here!

nickl- commented 11 years ago

This was on my agenda too, we can't do the one without the other (sanitization without validation) but you may certainly want to use the one without the other (validation without sanitization). The design of Respect/Validation lends itself to the addition of other rules, on paper at least.

We most certainly need to identify a deficit first and we can already do that, the question now is if we can also modify the value in chain. The simplest way to prove this is patching Respect/Validation with a proof of concept.

The main reason I am looking to see a patch, aside from the obvious time saving factors, is to see how we incorporate the normalization as additional rules alongside of the validation rules. To make the data altering a separate project is purely a cosmetic decision as @alganet points out it bears effect on Respect/Validation and since it is not something we've tried before, without a proof of concept that works we won't know what the impact is. Lets first see what happens when including "modifying rules" into the chain and then we talk logistics.

I suggest we take break off from this thread only once we are ready to entertain the implementation requirements, do you agree @mk? In the meantime we should focus an adding this functionality with as little change as possible to the current validation implementation. Hence the need for a small and concise patch or iow, make it work by only adding a new rule. Can we do that or do I have high hopes? =)

nickl- commented 11 years ago

Without getting fancy, lets solve the default value use case. Would that help speed this along?

Simply put:

If validation fails, assign supplied value and pass validation.

dbx834 commented 11 years ago

Hi all,

I had been working along the same lines and a while ago created a jQuery(-AJAX)-PHP data sanitizer built on top of the Respect/Validator. It is incredibly-rickety in terms of stability but it works, and works well as long as it works. When it crashes, no one knows why :P.

You can also sanitize data client-side itself if you wish to. Any data that can be described by the respect/validator as bad or good can be also sanitized. You can validate/sanitize the same data server side using the same function calls and methods. Other data (like text inside a file) can also be validated with ease. You can (rudimentarily but very easily) stitch a sanitizer and a validator that switches between two functions one after another (validate data than sanitize something than validate again).

These codes have been used internally by me and my colleagues. Seeing so many of you interested in such a system, I will make them open-source once I develop them more (for which I need time).

The most limiting thing is that our sanitizer is slow for larger data and complex text in structures (eg. XML). To counter-act that, I had been working on a task-queue system.

An example for forms:

Suppose there is a form and it has some text inputs. You want to sanitize the data that would be passed to the server. Now, the special thing about these methods is that the data is sent before the form is submitted. The data is queued to be sanitized preemptively.

For some input, user enters something.
Leaves the input textbox
AJAX hooks catch the data on focus-out.
Data posted to a handler on the server
Also posted are - sanitizing parameters (defined in a class in the input), and a unique ID (defined on the fly by JS)
The handler creates a queue for the various sanitizations (this runs in the background)
The user submits the form.
A quick string comparison checks data posted through AJAX is same as given (this is not needed, but nonetheless) This is all that is done with the submitted data.
We already have the sanitized data from the queued tasks and it is ready for further processing

What happens if user goes to the same input and types in something else?

We have the unique ID, remember?
Sends this id along with data again on focus out.

Unfortunately, I could not complete this as I was kept with many other projects. I admit this is an over-kill. Over sighted by my ambition, success with coming up with last-minute hacks, and frankly amazed at how far I had ventured, I did not stop when I should have had and ended up with this. However, such features would be incredibly useful for big data, for data-streams and for complex documents.

To those working on sanitizer functions, please bear in mind that sanitizers can be slow as the volume of data increases.

-Best