azavea / grout-2018-fellowship

This is a (possibly temporary) issue-tracking repo for the Grout fellowship in 2018; there's not anticipated to be much if any code here
2 stars 1 forks source link

Validate Records against their schema before creating them #44

Closed jeancochrane closed 6 years ago

jeancochrane commented 6 years ago

In a twist of fate, it turns out that there's no built-in validation to make sure that a new Record matches its schema in the REST API. In DRIVER, the schema editor takes care of schema validation, which is why we hadn't noticed it before.

The RecordSchema model has a validate_json method that seems primed to perform this kind of validation:

def validate_json(self, json_dict):
    """Validates a JSON-like dictionary against this object's schema
    :param json_dict: Python dict representing json to be validated against self.schema
    :return: None if validation succeeds; jsonschema.exceptions.ValidationError if failure
             (or jsonschema.exceptions.SchemaError if the schema is invalid)
    """
    return jsonschema.validate(json_dict, self.schema)

We should build this validation into the Record viewset and make sure that it's tested.

jeancochrane commented 6 years ago

While the tests are passing, I'm still able to add Records with the wrong schema by posting them through the API interface.

Steps to reproduce:

  1. Spin up a development server of Grout and navigate to /api/records
  2. Post a new Record with data that does not match the schema (I used {"foo": "bar})
  3. Confirm that Grout saves the Record

Not sure where this is coming from yet, since I've confirmed that A) the tests test this functionality and pass and B) the Record.clean_data method is indeed getting called.

jeancochrane commented 6 years ago

I think I'm closing in on the bug. My suspicion is that the problem here isn't with the validation step in the Record.clean_data method, but rather in the way that schemas are being defined by the schema editor.

By default, two things seem to be true when a schema is created in the schema editor:

  1. None of the form (related content) fields are marked as required in the schema definition object.
  2. additionalProperties is set to true by default on the schema object (per the JSONSchema spec).

As I understand it, these two properties mean that if an incoming Record has a wildly different schema from its RecordSchema, it may in fact validate. So for example if you have a schema with the following (simplified) definition:

{
  "properties": {
    "posterDetails": {
        "url": { "type": "string" },
        "name": { "type": "string" }
    }
  }
}

The following Record will validate, since A) it adds an additional property not defined in the schema above and B) the missing posterDetails form (related content) is not required:

{ "foo": "bar }

If this is correct, then my instinct is to find where the schema template is being created in https://github.com/azavea/grout-schema-editor/blob/develop/app/scripts/schemas/schemas-service.js and alter the two properties above. However, I'm still pretty new to JSONSchema, so I'm curious what more experienced folks think.

jeancochrane commented 6 years ago

As a side note -- I believe the relevant Grout test was passing because it defined the schema correctly:

        cls.schema = {
            "$schema": "http://json-schema.org/draft-04/schema#",
            "title": "Item",
            "description": "An item",
            "type": "object",
            "properties": {
                "id": {
                    "description": "The unique identifier for a product",
                    "type": "integer"
                },
                "name": {
                    "description": "Name of the product",
                    "type": "string"
                }
            },
            "required": ["id", "name"]

Failing Record:

        data = {
            'schema': self.record_schema.uuid,
            'occurred_from': timezone.now(),
            'occurred_to': timezone.now(),
            'geom': 'POINT(0 0)',
            'archived': False,
            'data': {
                'foo': 'bar',
                'baz': 'qux',
            },
        }
ddohler commented 6 years ago

I think this is exactly right, nice sleuthing! When I read point 1 in your diagnosis above, I had an "Ah hah!" moment, because I don't remember that ever being included but it makes perfect sense that it would need to be. I think if we want to fix this we may need to update the Record editing code so that it includes empty objects for blank forms. Alternatively, perhaps schemas-service.js needs to introspect the incoming schema and only add a form field to "required" if it contains required fields itself.

jeancochrane commented 6 years ago

Both of those solutions seem like good approaches to me! Is there any downside to implementing them both? Perhaps it's redundant, but I think it makes sense for empty forms to show up with an empty object, and also for forms that contain required fields to themselves be required.

RE: additionalProperties -- @flibbertigibbet raised a concern that making additionalProperties: false the default setting could raise inappropriate validation errors on Records whose schema had changed:

My suspicion is that unexpected extra fields may be allowed to validate, in case the schema version updates but the record does not, so that new/dropped fields won't produce errors [...] I could be wrong; I'm curious what derek might have to think on that one.

This seems like a reasonable concern to me, although when I try to think of concrete examples I can only come up with cases where it actually seems like a benefit to raise validation errors here. As I see it, if a Record's schema has changed, then it shouldn't be allowed to validate with old fields when the user edits it -- presumably that's why they changed the schema in the first place. I don't think this is the best possible behavior, and I still think full versioning for Records would help preserve the history of Records when a schema changes, but in the meantime I'm inclined to think the extra strictness of additionalProperties: false would be a good thing. Curious to hear what you think!

flibbertigibbet commented 6 years ago

This analysis sounds correct. I think a requirement for the DRIVER records was that existing records should always continue to validate, which may have been why server-side validation was left lax, but I could be wrong on that. I think @ddohler has some good ideas above on how to clean up the existing behavior.

If checking additionalProperties works for the demo implementation project, I think that would be a helpful addition to this library.

ddohler commented 6 years ago

re: validating not working as expected: I hear what you're saying about it making sense to always have empty forms; from the perspective of someone reading the JSON, it would certainly make life easier because you could figure out exactly what forms to expect without having to refer to the record's schema. I typed a big long thing in response to this, but at the end of it all, I realized one thing that made me want to do the second option (requiring forms that have required fields, but not generating blank forms): the current functionality stores an object that ends up being really close to what the user originally entered. I think that's a valuable characteristic for a form-entry application to have, and I'm a bit wary that breaking the symmetry between fields (which would still be omitted when null) and forms (which would be included-but-blank when null) might be the start of a slippery slope where we lose that ability to recover the user's original input easily.

re: additionalProperties: I don't have a lot to add except that this may be application-specific. When a user edits an old Record, are they presented with the form for the latest schema? When a user saves that Record, does its Schema reference get automatically updated to the latest schema? I think that either answering yes to both or no to both would be pretty reasonable (yes/no and no/yes seem a bit more complicated to understand), but I'm not sure under what situations users might expect one or the other to happen. Do any of the Grout components enforce any of these behaviors? If we're in the yes-yes case, then I think that's a good argument for keeping validation lax since the versioning issue is going to come up a lot, but if we're in the no-no case then we can probably outlaw additional properties because Records will always stick with their original Schema so we shouldn't run into versioning problems.

Another thought I had about this is that since it is configurable on a per-schema basis, perhaps we could allow the user to toggle it and put it under an "advanced" section? That might be more work than we can tackle this week, but it could be useful functionality to have.

Whichever way we end up going on additionalProperties, I think that dealing with the related content types will be a big improvement, because then we'll at least know that the Record's fields have some sort of overlap with the Schema's.

I feel like my thoughts on this are a bit scattered; I've tried to organize them as best I can, so I hope this is helpful!

jeancochrane commented 6 years ago

re: validating not working as expected: [...] the current functionality stores an object that ends up being really close to what the user originally entered. I think that's a valuable characteristic for a form-entry application to have...

That's a good point, and I think I agree. Sounds like the path forward here is to add some introspection logic when a schema is created, and if there are any required fields in a form, then make the form required, too -- but to leave the Record creation logic as-is so that forms don't get created if the user hasn't entered any data.

re: additionalProperties: I don't have a lot to add except that this may be application-specific. When a user edits an old Record, are they presented with the form for the latest schema? When a user saves that Record, does its Schema reference get automatically updated to the latest schema? [...] If we're in the yes-yes case, then I think that's a good argument for keeping validation lax since the versioning issue is going to come up a lot, but if we're in the no-no case then we can probably outlaw additional properties...

This is a really interesting question! I didn't know the answer off the top of my head so I ran a little test. Turns out the answers are:

So we're in an interesting yes-no middle ground. I agree with you that this is likely a behavior that different applications might want to manage this differently, and eventually it'd be cool to allow the user to toggle it in the config. But for now, it makes the most sense to me to change this so that we're in a yes-yes case and leave validation lax. Keeping all else in the app constant, changing to no-no would mean that the user couldn't update Records to a new schema -- we'd have to expose new views to allow that to happen. That would be cool to do eventually, but it seems like a bigger lift right now.

So, the changes I'm proposing coming out of this thread are:

  1. Make forms required if any of their fields are required
  2. Update the schema to the latest version when saving an edited Record, and leave validation as-is

Does that sound reasonable to everyone?

ddohler commented 6 years ago

I think it does! Updating to the latest schema on edit should be fine because any fields that have been deleted will be accepted by additionalProperties being true, while any fields that have been updated or added will be taken care of by presenting the newest form.

There might be some edge cases when populating the form, with respect to fields getting changed to incompatible types, and I'm not 100% whether fields that have been deleted will be passed through or whether we are regenerating a completely new object (which would cause deleted fields to be removed on edit), but those are both potential problems with the status quo as well, and I don't think either of these two changes would make them worse.

jeancochrane commented 6 years ago

The clear next step for this issue is encapsulated in https://github.com/azavea/grout-schema-editor/issues/24. We may also want to do more thinking about schema validation in Grout, but there's not a clear direction for that yet. Closing for now.