Crell / Serde

Robust Serde (serialization/deserialization) library for PHP 8.
Other
296 stars 14 forks source link

Order of PostLoad methods is unexpected #65

Open hschletz opened 3 months ago

hschletz commented 3 months ago

According to the documentation, "any PostLoad methods will be invoked with no arguments in lexical order." This is not true:

class MyClass
{
    private string $prop1;
    private string $prop2;

    #[PostLoad()]
    protected function a()
    {
        throw new \Exception('A');
    }

    #[PostLoad()]
    protected function b()
    {
        throw new \Exception('B');
    }
}

This will result in Exception "A", but if the methods get swapped, the result is Exception "B". It looks like the actual order is determined by the result of ReflectionClass::getMethods() (tested on PHP 8.3.9), where the order is not officially documented and may change any time.

Sometimes it is desirable to have multiple PostLoad methods executed in a particular order, or at least have limited control so that we can guarantee that a particular method is invoked before another. Having order depend on the order of declaration would already be bad because the order of declaration generally should not matter. Guaranteeing a particular order may become impossible when some PostLoad methods come from a trait or parent class.

The idea of lexical ordering, even if it worked, is also ugly: that would force us to name methods accordingly, possibly breaking code when a method is renamed, or resulting in names like aaa_first() and zzz_last().

Would it be possible to add an optional "priority" parameter to the PostLoad attribute? That would give developers enough control over the invocation order, while not restricting how methods should be named or organized.

If this is not possible, the result of ReflectionClass::getMethods() should be sorted do give some predictability and match the documentation.

Crell commented 3 months ago

Hrm. "Lexical ordering" actually has two slightly different meanings. In this case it means "the order of the lexer", as in, the order in the file. (There's a chance I'm wrong about the wording, I'll have to double check. But "order in the file" is the intended and correct behavior.) Or, more precisely, it's in the order the methods are returned by reflection, which is up to PHP and should always (I think) be in file-line order, so it is deterministic.

Adding a priority option would... I think be possible. I'm just wary of adding too much complexity. In my event dispatcher, I have both priority numbers and topological ordering, which seems like overkill here but I could see someone asking for in the future.

... Though... if the ordering is done at build time in ClassSettings, there would be no runtime impact (assuming caching). And... that might even allow for more interesting ordering if we instantiate an Ordered Collection (what Tukio uses) directly in the attribute. I am not crazy about that idea but... I don't think it would be that difficult? Hm.

I will have to ponder this further. For now at least, "order returned by reflection, which should be the order in the file" is the logic, and it is deterministic, so you can structure the code accordingly.

hschletz commented 3 months ago

OK, I misunderstood that as "alphabetical order". The current implementation makes sense too, but should not be relied upon for the reasons outlined above (undocumented behavior, composition from multiple sources, significance of order that should not have significance). Relying on the order would make the code extremely fragile, so IMHO the documentation should point out not to do that.

It's not a critical issue. If various actions need to be performed in a particular order, they can be handled reliably in a single method. Composition can be achieved by calling various methods from there. Wiring up multiple methods by attributes (in a way that does not rely on specific code organization and PHP's implementation details) would just be a nice addition that would remove the need for a glue method.

Crell commented 3 months ago

I'm curious what use case you have that makes multiple PostLoad methods make sense in the first place. The only use cases I could think of when making it were extra validation and cache warming, both of which are pretty simple and easy to order; or they can just be put into one method. When would you have a complex enough situation to warrant enough separate methods that custom ordering is even needed?

hschletz commented 3 months ago

The use case I have in mind is form validation, composed from multiple classes/traits, with dependencies between various validation steps. There are basically 2 categories of validation which should be handled differently:

The first category involves general sanity checks. Serde can handle some of these. For example, if the form markup contains <input type="text" name="foo">, there must be a "foo" key in the form data, which can be ensured by marking the field as required. There are more complex checks like validating a token for CSRF protection, which must be implemented by custom logic.

These sanity checks should never fail in practice, but must be done anyway because invoming request data cannot be trusted. Failure does not have to be handled nicely - simply throwing an exception is typically enough.

Once all sanity checks have passed, field data has to be checked for incorrect input which can happen by user action, like leaving a required field blank. Typical handling involves collecting all errors and re-showing the form with validation messages near the corresponding field.

A single-method validation class may look like this:

#[ClassSettings(requireValues: true)]
class FormData
{
    private string $csrfToken; // only for sanity check, not useful for later evaluation

    public string $field1; // user input
    public string $field2; // user input

    #[PostLoad]
    protected function validate()
    {
        if (!CsrfValidator::isValid($this->csrfToken)) {
            throw new SanityCheckFailedException('Invalid CSRF token');
        }

        $errors = [];
        if ($this->field1 == '') {
            $errors['field1'][] = 'Input required';
        }
        if (Database::recordExists($this->field1)) {
            $errors['field1'][] = 'Duplicate value';
        }
        if ($this->field2 == '') {
            $errors['field2'][] = 'Input required';
        }
        if ($errors) {
            // Calling code should catch this and evaluate attached errors
            throw new InvalidUserInputException($errors);
        }
    }
}

Some validations like the CSRF check have to be done in many forms. To avoid repetition, it could be moved to a trait:

trait CsrfValidation
{
    private string $csrfToken;

    #[PostLoad]
    protected function validateCsrfToken()
    {
        if (!CsrfValidator::isValid($this->csrfToken)) {
            throw new SanityCheckFailedException('Invalid CSRF token');
        }
    }
}

#[ClassSettings(requireValues: true)]
class FormData
{
    use CsrfValidation;

    public string $field1;
    public string $field2;

    #[PostLoad]
    protected function validate()
    {
        $errors = [];
        // form-specific validation like above
    }
}

To keep validation logic intact, validateCsrfToken() must be run before validate(). Otherwise there would be incorrect error handling in case of an invalid token AND invalid user input. This is difficult (or even impossible) to guarantee without explicit specification, and the order is not obvious when methods are defined in multiple classes/traits. Topologicial ordering would be a nice tool for this problem.

This can be worked around by removing the PostLoad attribute from validateCsrfToken() and calling it manually from validate(). So no critical blocker here, just one extra step.

It should be noted that validation inside the data class is only feasible for simple checks. The examples contain static mathod calls for simplicity. For more complex and reusable logic, validation should probably be handled by injectable dependencies, which would be difficult to implement for data-only objects. So I will have to resort external validation anyway.

Bottom line: Personally, I will have to solve the problem in a different way, so the issue is currently not relevant for me. I wanted to point out the perceived discrepancy between documemented and observed behavior, which turned out to be a misunderstanding. As for the use case, the example should have made it clear. I think that in-class validation with explicit ordering may still be the simplest option (from a consumer's perspective) als long as no DI is involved, but it's not the only solution.

Crell commented 3 months ago

Interesting! Processing incoming form data is one of the use cases I had envisioned for Serde, but I hadn't thought that far ahead to how to dovetail validation together.

That said, my general stance is that since, as you note, many kinds of validation require a service of some kind, it doesn't make sense to do in a data object. Rather, my intent was to say "just go use a validation library", like Symfony/Validator, or if I got bored I'd write one myself based on AttributeUtils. (I haven't gotten that bored yet.) One of the main reasons is that currently, PostLoad methods have only limited ability to interact with the caller. For validation, you generally don't want an exception but some well-structured validation error messages that can be formatted nicely. That's what a tool like Symfony/Validator supports.

To really bake that into Serde would require restructuring the whole error handling process to essentially be a Result type monad, which would be a significant API break for users right now. Likely it would also require a separate #[Validator] set of methods. However, that would still have the same dependency problem for non-simple cases.

So really, I think there's an upper-bound on the kind of validation that a deserializer should try to handle. It should really only be structural issues, like "value A is larger than value B". For more elaborate validation, I do think an external validation tool is more appropriate.

That said, thinking on it, it likely would be possible to incorporate sorting into the attributes. The question is whether to only support priority numbers (which could likely be done internally) or also support topological (which would require an additional dependency for OrderedCollection). I'm also not sure if it would make more sense in Serde, or to bake down into AttributeUtils.

So I'll add ordering to my mental todo list for whenever I find an opportunity, but it doesn't sound like that would fully address the validation case anyway.