VH-Lab / DID-matlab

Data Interface Database
Other
1 stars 1 forks source link

validation schema checking needed #33

Closed stevevanhooser closed 1 year ago

stevevanhooser commented 2 years ago

(We will discuss today, see https://github.com/VH-Lab/DID-matlab/blob/validation/docs/development/did-schema4.md)

altmany commented 1 year ago

I pushed a new version of did.database.m to the SQL branch, which implements schema validation (lines 428 and 909-1235), with the following caveats:

  1. 'DID_' and 'DID.' prefixes are ignored when comparing (case-insensitive) classnames. i.e. "demoA", "did_demoA" and "DID.DEMOA" are considered equivalent and will pass the classname validation rule. You can modify this behavior by setting IGNORE_DID_CLASS_PREFIX=false in did.database.m line 983.
  2. The queryable field is currently ignored. This will be added at a later stage.
  3. When validating super-classes, their own super-classes are not checked. The reason was that it was not clear when checking super-classes whether to validate it against the original (child) document (to validate the base fields) or against its super-class document (which defines the super-class's parents). The former seems more important, so the latter validation is skipped.
  4. The current code does not optimize/cache validation, so it is possible that the same super-class might be revalidated several times. Performance-wise the effect is minor at best, so I felt it was not worth the effort.
stevevanhooser commented 1 year ago

Thanks Yair

I'm checking these changes, I get an error when I run did.test.test_did_db_documents

It looks like did.database.all_doc_ids needs to take a branch as an argument and use the current branch if none is given.

I'll keep testing when this is fixed.

Thanks Steve

The full error trail is

Error in did.implementations.sqlitedb/do_get_doc_ids (line 230)
                      '   AND branch_id="' branch_id '"'];

Error in did.database/all_doc_ids (line 350)
            doc_ids = database_obj.do_get_doc_ids();

Error in did.database/validate_docs (line 911)
            all_ids = database_obj.all_doc_ids();

Error in did.database/add_docs (line 428)
            database_obj.validate_docs(document_objs);

Error in did.test.test_did_db_documents (line 29)
db.add_docs(docs);
altmany commented 1 year ago

I forgot to commit sqlitedb.m, it should be good now. You still get an error when you run did.test.test_did_db_documents, but this error is expected (because you forgot to define the document_version field in the base schema):

>> did.test.test_did_db_documents
Validating demoA doc 41268cb359de9b63_c0b7910d4270c0ef
Validating demoA doc 41268cb359de9b63_c0b7910d4270c0ef (superclass base)

Error using did.database/validate_doc_vs_schema
Dissimilar sub-fields defined/found for base field in demoA doc 41268cb359de9b63_c0b7910d4270c0ef (superclass base)
("datestamp,document_version,id,name,session_id" <=> "datestamp,id,name,session_id")

Error in did.database/validate_doc_vs_schema (line 1043)
                                database_obj.validate_doc_vs_schema(docProps, schemaStruct2, all_ids);

Error in did.database/validate_docs (line 942)
                database_obj.validate_doc_vs_schema(docProps, schemaStruct, all_ids);

Error in did.database/add_docs (line 428)
            database_obj.validate_docs(document_objs);

Error in did.test.test_did_db_documents (line 29)
db.add_docs(docs); 
altmany commented 1 year ago

Note that _all_docids returns all the doc_ids defined in the DB, not just the docs in the current/specified branch.

stevevanhooser commented 1 year ago

Thanks Yair

I noticed that the new validation schema were not yet merged into sql.

Before I merged, I got the following error. I don't get it after merging but I think it means that it's possible that validation in the function might not be defined under some scenarios.

Before I merged, with did.test.test_did_db_documents I get:

Error in did.database/get_document_schema (line 961)
                        error('DID:Database:ValidationFileMissing','Validation file "%s" not found',validation);

Error in did.database/validate_docs (line 939)
                schemaStruct = database_obj.get_document_schema(schema_filename);

Error in did.database/add_docs (line 428)
            database_obj.validate_docs(document_objs);

Error in did.test.test_did_db_documents (line 29)
db.add_docs(docs);

After merging, I'm getting no errors on our existing tests, which is great!

We will write some advanced testing over the next few days to try to intentionally make some errors and make sure that they have good error messages.

Thanks Steve

stevevanhooser commented 1 year ago

Hi Avi -

I pulled the new changes that Yair made into the branch valtest

I see a new test function

did.test.test_did_db_documents_invalid

But it only seems to test for one way for the documents to fail by messing up many fields in one document. It doesn't test many ways the documents might fail separately, like the did.test.test_did_db_queries method. Can you make it like that, where there are many modes that might be tested, and we can test that each one fails?

Right now, we can't be sure that the integer detection algorithm is the only part of the validator that is working. Does that make sense?

Thanks Steve

aviL221 commented 1 year ago

@stevevanhooser Hi Steve,

I am unsure about what output to expect from some of the invalidation tests. When I remove the fields document_properties.document_class, document_properties.document_class.definition, document_properties.document_class.validation, document_properties.document_class.property_list_name, and document_properties.document_class.class_version, it does not recognize the documents as invalid. However, when I remove the field document_properties.document_class.class_name, the subsequent documents are invalidated. This feels somewhat inconsistent - which of these outcomes (or both, or neither) is correct?

% no error (passes validation):
[b,msg] = did.test.test_did_db_documents_invalid('remover','document_class');
[b,msg] = did.test.test_did_db_documents_invalid('remover','definition');
[b,msg] = did.test.test_did_db_documents_invalid('remover','validation');
[b,msg] = did.test.test_did_db_documents_invalid('remover','property_list_name');
[b,msg] = did.test.test_did_db_documents_invalid('remover','class_version');
% produces error (does not pass validation):
[b,msg] = did.test.test_did_db_documents_invalid('remover','class_name');

Thanks, Avi

stevevanhooser commented 1 year ago

Hi Avi -

This is a good list. Do all the other fields act as expected, like removing the base field (should cause an error) or removing the ID of a base field or leaving it empty, etc?

Thanks Steve

aviL221 commented 1 year ago

So far I have 29 tests and other than the ones I pointed out, they acted as I expected (they were caught as invalid):

% result of all tests:
[test_results,test_messages] = run_test_invalid();

Removing the base field is invalidated, as expected:

% this test passes - the modified document(s) is(are) invalidated:
[b1,msg1] = did.test.test_did_db_documents_invalid('remover','base');

Removing the ID of a base field is invalidated, as expected:

% this test passes - the modified document(s) is(are) invalidated:
[b2,msg2] = did.test.test_did_db_documents_invalid('remover','id');

Leaving the ID empty is invalidated, as expected:

% these tests pass - the modified documents are invalidated:
[b3,msg3] = did.test.test_did_db_documents_invalid('value_modifier','blank int');
[b4,msg4] = did.test.test_did_db_documents_invalid('value_modifier','blank str');
stevevanhooser commented 1 year ago

Cool. Do you think we have a complete list to send now (the tests mentioned above?)

aviL221 commented 1 year ago

I think it's a complete list of all possible field removals - however I think there are still some field modifications that I haven't implemented yet

stevevanhooser commented 1 year ago

Hi Yair -

The validation seems to work perfectly except for a couple of cases.

My tech Avi writes: "When I remove the fields document_properties.document_class, document_properties.document_class.definition, document_properties.document_class.validation, document_properties.document_class.property_list_name, and document_properties.document_class.class_version, it does not recognize the documents as invalid. However, when I remove the field document_properties.document_class.class_name, the subsequent documents are invalidated.

% no error (passes validation) but should be an error:
[b,msg] = did.test.test_did_db_documents_invalid('remover','document_class');
[b,msg] = did.test.test_did_db_documents_invalid('remover','definition');
[b,msg] = did.test.test_did_db_documents_invalid('remover','validation');
[b,msg] = did.test.test_did_db_documents_invalid('remover','property_list_name');
[b,msg] = did.test.test_did_db_documents_invalid('remover','class_version');
% produces error (does not pass validation):
[b,msg] = did.test.test_did_db_documents_invalid('remover','class_name');

document_class is a standard structure that exists in all documents, so it should be checked always, it is outside of what is specified in the schema. Does that make sense?

Thanks Steve

altmany commented 1 year ago

I think that you misinterpreted the error message and cause. This is what I get on my end:

>> [b,msg] = did.test.test_did_db_documents_invalid('remover','document_class');
Validating demoB doc 41268d0332bcba80_c0c70ff7315c9b43
Validating demoB doc 41268d0332bcba80_c0c70ff7315c9b43 (superclass base)
Validating demoB doc 41268d0332bcba80_c0c70ff7315c9b43 (superclass demoA)
Invalid non-numeric sub-field demoB.value found in demoB doc 41268d0332bcba80_c0c70ff7315c9b43
Error due to one of the following modifiers: 
value_modifier:'sham'
id_modifier:'sham'
datestamp_modifier:'sham'
session_id_modifier:'sham'
dependency_modifier:'sham'
remover:'document_class'

The important line is: Invalid non-numeric sub-field demoB.value found in demoB This error is due to the fact that in did.test.documents.make_doc_tree_invalid.m line 99 you set a string value into the field demoB.value, which is defined in the schema as "integer":

    docs{end+1} = did.document('demoB','demoB.value',int2str(counter),...

The Invalid non-numeric sub-field error is the only one that is reported by database.m - The rest of the messages below it are programmatic (in `did.test.test_did_db_documents_invalid.m lines 46-53) and are misleading because they have no relation to any removed fields.

aviL221 commented 1 year ago

Hi Yair, thanks for pointing this out - it seems I didn't update the remote repo before posting my issue.

Could you please try again with this updated list:

% no error (documents pass validation) despite being modified:
[b,msg] = did.test.test_did_db_documents_invalid('remover','document_class'); 
[b,msg] = did.test.test_did_db_documents_invalid('remover','definition');
[b,msg] = did.test.test_did_db_documents_invalid('remover','validation');
[b,msg] = did.test.test_did_db_documents_invalid('remover','property_list_name');
[b,msg] = did.test.test_did_db_documents_invalid('remover','class_version');
[b,msg] = did.test.test_did_db_documents_invalid('other_modifier','invalid definition'); %replaces the definition field in document_class with a nonsense string
[b,msg] = did.test.test_did_db_documents_invalid('other_modifier','invalid property list name'); %replaces property_list_name with a nonsense string
[b,msg] = did.test.test_did_db_documents_invalid('other_modifier','new class version number'); %sets class_version to 2 instead of 1
[b,msg] = did.test.test_did_db_documents_invalid('other_modifier','class version string'); %sets class_version to a nonsense string
[b,msg] = did.test.test_did_db_documents_invalid('other_modifier','invalid base name'); %sets name field in base class to a nonsense string
stevevanhooser commented 1 year ago

Hi Avi - same question here, which branch is this code on? Thanks

aviL221 commented 1 year ago

This should be the branch "valtest"

stevevanhooser commented 1 year ago

This is now merged into sql.

Yair, the errors are

% no error (documents pass validation) despite being modified:
[b,msg] = did.test.test_did_db_documents_invalid('remover','document_class'); 
[b,msg] = did.test.test_did_db_documents_invalid('remover','definition');
[b,msg] = did.test.test_did_db_documents_invalid('remover','validation');
[b,msg] = did.test.test_did_db_documents_invalid('remover','property_list_name');
[b,msg] = did.test.test_did_db_documents_invalid('remover','class_version');
[b,msg] = did.test.test_did_db_documents_invalid('other_modifier','invalid definition'); %replaces the definition field in document_class with a nonsense string
[b,msg] = did.test.test_did_db_documents_invalid('other_modifier','invalid property list name'); %replaces property_list_name with a nonsense string
[b,msg] = did.test.test_did_db_documents_invalid('other_modifier','new class version number'); %sets class_version to 2 instead of 1
[b,msg] = did.test.test_did_db_documents_invalid('other_modifier','class version string'); %sets class_version to a nonsense string
[b,msg] = did.test.test_did_db_documents_invalid('other_modifier','invalid base name'); %sets name field in base class to a nonsense string

Thanks Steve and Avi

altmany commented 1 year ago

The fact that you remove or modify certain document fields does not (by itself) make the document invalid. It depends on the document's schema - if the schema does not require the deleted field etc. there is no validation problem.

So I don't understand what the problem is. You need to be more verbose in what you ask me, because I am not there next to you and I cannot read your minds [at least yet...]. For example, you need to say something like:

"in file abcd.json we define rule XYZ and yet although in file efgh.m line 123 we remove this field from document Doc_123, the document passes validation when we add it to the database."

Without such verbose explanation you force me to go on a fishing expedition to try to figure out what you mean and what the problem is (if there is in fact a problem). This is both very inefficient and quite frustrating.

stevevanhooser commented 1 year ago

Hi Yair -

I'm sorry it's frustrating. I think I see the issue.

The document_class field in the document JSON code is not explicitly in the schema, but it is part of every document and is part of an implied schema.

Let's follow an example. Let's look at an example document type, demoA. Here is the schema:

DID-matlab/example_schema/demo_schema1/database_schema/demoA.schema.json

{
    "classname":    "demoA",
    "superclasses": [ "base" ],
    "depends_on": [ ],
    "file": [ ],
    "demoA": [
        {
            "name": "value",
            "type": "integer",
            "default_value": 1,
            "parameters": [-100000,100000,0],
            "queryable": 1,
            "documentation": "A value field for the demoA class."
        }
    ]
}

And here is an example document:

{
  "document_class": {
    "superclasses": {"definition": "$DIDDOCUMENT_EX1/base.json"},
    "class_version": 1,
    "definition": "$DIDDOCUMENT_EX1/demoA.json",
    "property_list_name": "demoA",
    "class_name": "demoA",
    "validation": "$DIDSCHEMA_EX1/demoA.schema.json"
  },
  "demoA": {"value": 1},
  "base": {
    "datestamp": "2023-03-05T17:25:30.102Z",
    "name": "",
    "session_id": "",
    "id": "41268d0d73bbd6f9_c0ddad560858299e"
  }
}

Now suppose someone wants to pass off the following JSON code as a valid document, let's call it X:

{
  "document_class": {
    "superclasses": {"definition": "$DIDDOCUMENT_EX1/base.json"},
    "class_version": 1,
    "definition": "$DIDDOCUMENT_EX1/demoA.json",
  },
  "demoA": {"value": 1},
  "base": {
    "datestamp": "2023-03-05T17:25:30.102Z",
    "name": "",
    "session_id": "",
    "id": "41268d0d73bbd6f9_c0ddad560858299e"
  }
}

X is not a valid document. It's missing the fields document_class.property_name_list and document_class.class_name with proper values that are built when the document demoA is built. But, document_class is not specified on a field by field basis in the schema, but rather its values depend on what occurs when a properly constructed demoA is built. If one made an example document with d = did.document('demoA');, it would have a properly constructed document_class field.

It's a bit challenging to check this. Probably the easiest thing is to actually build a blank document and just check that it matches.

Does this make sense? document_class has meta requirements, which is why it is confusing.

Best Steve

altmany commented 1 year ago

The latest committed version in the sql branch now reports an error for:

[b,msg] = did.test.test_did_db_documents_invalid('remover','document_class'); 
[b,msg] = did.test.test_did_db_documents_invalid('remover','property_list_name');
[b,msg] = did.test.test_did_db_documents_invalid('remover','class_version');

...and also when removing the document's document_class.class_name field (or setting it to empty)

However, no error is reported for these (the reason for each is stated in a comment):

[b,msg] = did.test.test_did_db_documents_invalid('remover','validation');
%A document with no defined validation file is still valid
%(however, an error is reported if a file is specified but not found)

[b,msg] = did.test.test_did_db_documents_invalid('remover','definition'); 
%A document with no definition is still valid
%The document_class.definition field is not used anywhere in the code

[b,msg] = did.test.test_did_db_documents_invalid('other_modifier','invalid definition');
%A document with a bad definition is still valid
%The document_class.definition field is not used anywhere in the code

[b,msg] = did.test.test_did_db_documents_invalid('other_modifier','invalid property list name'); 
%I have no idea how a valid value is defined, so we only check if the field is missing or empty.
%The document_class.property_list_name field is not used anywhere in the code

[b,msg] = did.test.test_did_db_documents_invalid('other_modifier','new class version number');
%I have no idea how a valid value is defined, so we only check if the field is missing or empty
%The document_class.class_version field is not used anywhere in the code

[b,msg] = did.test.test_did_db_documents_invalid('other_modifier','class version string');
%I have no idea how a valid value is defined, so we only check if the field is missing or empty
%The document_class.class_version field is not used anywhere in the code

[b,msg] = did.test.test_did_db_documents_invalid('other_modifier','invalid base name');
%I have no idea how a valid value is defined, so the document_properties.base.name field is not currently checked
%The document_properties.base.name field is not used anywhere in the code

Let me know if/how you wish me to process each of these.

stevevanhooser commented 1 year ago

Hi Yair and Avi -

Right now we have no more tests to put forward, things are working well as far as we can tell! Thanks

Steve