Current thinking on validation

dladd commented 8 years ago

In a meeting at the ABI on Tuesday, there was some discussion about how we might implement validation as the next step in the use-cases doc.

The team here was in favour of (A) setting up a Validator object, similar to how we deal with the Parser in the changeset under review in #114 . Consistent with this, it was also suggested that we should probably move our serialisation routines into a separate Serialiser object (we might want to call this Printer or something else instead), rather than doing serialisation from the objects themselves.

For example, a user wishing to check if a model is valid, output it if so or output errors if not might do:

    libcellml::Validator validator;
    libcellml::Serialiser serialiser(libcellml::Format::XML);
    validator.validate(model);
    if (validator.hasErrors()) {
        for (size_t i = 0; i < validator.errorCount(); ++i) {
            std::cout << validator.getError(i)->getDescription();
        }
    } else {
        std::cout << serialiser.serialise(model);
    }

An alternative (B) might be to do validation on entities as we currently do for serialisation, for example:

    model->validate();
    if (model->hasErrors()) {
        for (size_t i = 0; i < model->errorCount(); ++i) {
            std::cout << model->getError(i)->getDescription();
        }
    } else { 
        std::cout << model->serialise(libcellml::Format::XML);
    }

While B is a bit more concise, the thinking is that it is worth the extra line or two to make our reading, writing, and validation error logging consistent. And Parser may need to pick up errors that do not necessarily translate into an Entity (e.g. unrecognised or mismatched XML tags).

With the validator, there is also the question of when/if to automatically clear errors. For instance, consider if we call the following validation on two very simple models:

    libcellml::ModelPtr m1 = std::make_shared<libcellml::Model>();
    libcellml::ModelPtr m2 = std::make_shared<libcellml::Model>();
    libcellml::Validator validator;
    validator.validate(m1);
    validator.validate(m2);
    m1->setName("model1");
    validator.validate(m1);

validator might now contain the following errors regarding the requirement for valid models to have names:

(1) no errors if we clear validation errors every time validate() is called.
(2) errors from m1 and m2 if we allow validation errors to persist until the user calls validator.clearErrors()
(3) an error from m2 only if we remove any previous errors pertaining to the object being validated from the validator and add back any new errors from the updated object.

I don't have a strong preference but would probably go with 3.

jonc125 commented 8 years ago

I'm very much in favour of (A). For consistency doing the same for serialization would make sense, but I wouldn't make this a high priority!

Regarding clearing errors, I don't really see any benefit to doing something more complex than (1). Surely the most common case is that you want to report errors straight after validating a model? If you do want to validate several models and keep all errors, you can create multiple Validator instances to do so.

hsorby commented 8 years ago

I'm in favour of (A), separate enablers for the model. For me it keeps the model classes clean and encapsulated better, allows for more consistency if the parser is definitely an external object, and other people can use the same form if they need to add anything bespoke.

agarny commented 8 years ago

I am in favour of (A) and (1). (Agreed with Jonathan regarding (1).)

nickerso commented 8 years ago

A and 1 for me too.

hsorby commented 8 years ago

I am in favour of option 2. I wouldn't want the errors to be cleared until I said so.

jonc125 commented 8 years ago

Depends what you mean by "said so" - for me, asking to validate a different model means you no longer care about errors from a previous model!

hsorby commented 8 years ago

Yes in the main I would agree but we would be restricting the ways in which the library can be used. By requiring a developer to explicitly clear the errors and leaving errors to accumulate you are not restricting them to single passes. Also, a lot of other libraries work this way where errors are not reset until instructed.

nickerso commented 8 years ago

While I still believe errors should be cleared whenever validate() is called, perhaps a better example for the alternate is not model-level validation but thinking of the case in the future where a user might want to validate a series of components manually?

But I still can't see why you'd want to have the added overhead of tracking errors from multiple sources in the same validator instance?

hsorby commented 8 years ago

So then let's just have a call to clear errors at the start of the validate method and we are in a fine position going forward. Sorted.

dladd commented 8 years ago

In the case of the Parser, we decided to only parse from the model level and save parsing from other entities for later (if ever). However, am I correct in thinking we don't necessarily want to do the same for the Validator? I would think it would be quite useful for a user to, for instance, check if a component they just modified or constructed is valid before adding it to a model.

hsorby commented 8 years ago

For me that is a second level priority. Let's not concern ourselves with that at this time, that functionality can easily be added once the validator can validate models.

nickerso commented 8 years ago

I agree, getting validation of a model is the top priority.

dladd commented 8 years ago

Fair enough- I suppose it is difficult to define what a 'valid' state is for an entity outside of its model context anyway. For instance, there is no way to determine if a variable initial_value as a variable reference is valid without knowing what the other variables in the parent component are.

dladd commented 8 years ago

How far do we want to go with MathML validation at this stage? I think we should probably try to validate the math string by parsing it with libxml2 + the W3C MathML DTD. That will also give us a something to test the MathML printer/parser with when we move on to integrating a CAS.

nickerso commented 8 years ago

yep - that seems reasonable. It is also needed to be able to easily iterate over ci elements and cellml:units attributes to check them.

dladd commented 8 years ago

In a quick discussion at the ABI last week, it was brought up that it would be useful for the Validator to be able to point to the particular Specification rule a given validation error violates. As an example, let's consider an unnamed model, which violates rule 4.2.1:

Every model element MUST contain a name attribute in the CellML namespace. The value of the name attribute MUST be a valid CellML identifier, and SHALL be interpreted as a unique identifier for the CellML infoset.

I think it makes sense to add another attribute to our Error class- say mSpecificationRule. This would allow us to specify "why" the error occurred with mSpecificationRule pointing to the rule above, "what" went wrong with mDescription i.e.,

Model does not have a valid name attribute.

and "where" the error occurred by allowing the the user to do error->getModel(). mSpecificationRule could then be:

(A) A string, most likely "4.2.1" in this case
(B) A member of an enum, perhaps SpecificationRule::ModelName, which could then be mapped to a string for reporting.

Personally, I don't think this needs to be more complicated than (A) unless we still expect the Spec rule ordering to shift substantially.

agarny commented 8 years ago

(A) would indeed be nice, but I would find it even more useful if libCellML could also provide us with the URL to the full specification rule (unless we could safely determine that URL from mSpecificationRule?).

dladd commented 8 years ago

I would find it even more useful if libCellML could also provide us with the URL

Just had a quick chat with @nickerso and it sounds like a unique identifier (B) might be the way to go. Then we can have separate methods to convert that identifier to the section title, a URL, or the full rule with a map/dictionary.

Before we start adding URLs, we probably want to wait until the Specification's resting place is agreed upon (there is apparently some discussion ongoing on whether to keep it as a GoogleDoc for ease of contribution). But we can start with the mSpecificationRule and a method to get the section title. Other maps and methods can be added later.

dladd commented 8 years ago

Another minor issue in terms of implementing our Validator (and eventually the Printer) as a separate class is that we don't currently have a Unit object in the library. We just have add/remove/countUnit() methods in the Units class.

I can see why this was done- we don't want to confuse a user with the differences between Unit and Units and the manipulations we need to do to a given Unit are relatively limited. But we need to be able to getUnit if we want to validate/print outside of the Units class.

If we want to keep Unit private but still do all the validation/printing in the separate enablers, we could take a similar approach to how we deal with private XML classes in the Parser:

(1) Make a private Unit class, which we #include in validator.cpp. Methods that use Unit will go in the private implementation of the Validator/Printer.

Other possibilities are:

(2) Create a public Unit object
(3) Do validation/printing of each Unit in the Units class.

I would personally be fine with (3) as the most expedient option but (1) is probably the most "correct" way to sort this out.

nickerso commented 8 years ago

(3) is not only the most expedient, but also the "correct" way in my view of libCellML.

dladd commented 8 years ago

Ah OK maybe I was overthinking this then. Is just a case of the more straightforward approach being the correct one?

agarny commented 8 years ago

Just had a quick chat with @nickerso and it sounds like a unique identifier (B) might be the way to go. Then we can have separate methods to convert that identifier to the section title, a URL, or the full rule with a map/dictionary.

I actually agree with that. My bad for originally suggesting an 'improved' version of (A).

Otherwise, agreed on (3) too.

dladd commented 8 years ago

Building on the decision to go with (3)(validation of each Unit from within the Units class), we'll need to get access to the unit-validating method (Units::getUnitValidationErrors()) from the Validator. But we also probably don't want to publicly expose that method- would this be a reasonable situation to use friend?

agarny commented 8 years ago

When I originally saw the friend keyword, I immediately thought "no way" and decided to re-read the latest messages and... I get the feeling that option (3) (i.e. do validation/printing of each Unit in the Units class) goes against what we had originally agreed and what has already been implemented in the Validator class so far.

Indeed, from what I remember and see in the Validator class, we do all the validation at the Validator class level and not at the Model, Component, etc. class levels. So, why should we be doing things differently for units? In other words, I would expect units validation to be done in the Validator class and not in the Units class.

... or am I missing something?...

dladd commented 8 years ago

That was my original thinking behind suggesting (1):

(1) Make a private Unit class, which we #include in validator.cpp. Methods that use Unit will go in the private implementation of the Validator/Printer.

I was trying to make it so we could have all our validation operations within the Validator. The issue was that we need to get access to each Unit in a given Units during validation but we don't have a Unit object- just a struct Unit in Units that contains the necessary attributes (prefix, exponent, etc.).

My thinking with (3) is that (1) might overcomplicate things for what is ultimately one or two methods. But for (3) we also need to get access to a Units method from the Validator so friend is a workaround to avoid publicly exposing an internal method.

Of course, this is also open to any further suggestions :-)

agarny commented 8 years ago

Well, (3) is not an option for me anymore since it effectively breaks our agreed approach with regards to the Validator class.

So, maybe the solution consists of making the Validator class a friend (!!) of the Units class, and have in the Units class some private methods that will allow the Validator class to retrieve anything it needs about a given Unit?

nickerso commented 8 years ago

Just thinking about this, I think this is the same issue that is going to crop up when we want to actually do anything with Units so it might be worth getting a "better" solution than befriending classes. But not going quite so far as adding a Unit class to the public API.

How about adding a method like this:

Units::getUnit(int index, std::string& name, double& prefix, double& exponent, double& multiplier, double& offset) const;

or maybe:

const std::string& Units::getUnitName(int index);
double Units::getUnitPrefix(int index);
double Units::getUnitExponent(int index);
...

or both to match the addUnit methods.

agarny commented 8 years ago

@nickerso, I indeed had something like that in mind, but... then the question is whether we want libCellML users to be able to access those methods? I thought the answer was no and that it was the reason the Unit structure is declared in the units.cpp and not in units.h.

So, if we don't mind libCellML users to have access to those methods, then your suggestion looks fine to me. On the other hand, if we don't want libCellML users to have access to them, then we would have no other choice but to make Validator a friend of Units.

dladd commented 8 years ago

I think @nickerso's getUnit() and getUnitAttribute() suggestions are the most logical so far. And they might actually be useful for a user wanting to, for instance, check the units reference in order to remove a specific unit.

I think I personally have a slight preference for the getUnitAttribute() method as I see that being more useful in practice but getUnit() matches up better with addUnit() so I am happy either way.

nickerso commented 8 years ago

I'm now leaning toward users have access to that information, as I think this is the "cleanest" way to expose this information. And if we want to keep the math handling external to libCellML, then we'll likely need access to that information without being able to use friend.

agarny commented 8 years ago

+1 for Units::getUnit() being public.

dladd commented 8 years ago

The current plan is to use the MathML DTD to separately validate our opaque math string. However, when parsing a valid CellML model, we may end up with a situation where the cellml namespace has been declared in a parent element (model, component). Then the math string is not self-contained and we would get validation errors for any cellml:units attributes.

I had a chat with @nickerso and a possible solution might be to just say that we expect a valid math string to be independently valid. During parsing, We would copy any namespaces declared in parent elements into <math>.

In addition to helping out with validation, this "math as a valid subdocument" requirement will probably also be useful if we want to hang on to units attributes when parsing our math string into a CAS.

agarny commented 8 years ago

I had a chat with @nickerso and a possible solution might be to just say that we expect a valid math string to be independently valid. During parsing, We would copy any namespaces declared in parent elements into <math>.

I am confused, are we not removing non-MathML elements from the math string anymore (see here)?

If not, then copying the namespaces into <math> should work and should indeed be useful for anything that needs access to units attributes in the future.

In addition to helping out with validation, this "math as a valid subdocument" requirement will probably also be useful if we want to hang on to units attributes when parsing our math string into a CAS.

I clearly missed it the first time round, but what does CAS stand for exactly?

nickerso commented 8 years ago

The plan is still to remove non-MathML element to validate the MathML itself, but we also need to validate the units. But to validate the units we need to keep track of the namespace of the units attributes - information that is lost if not contained in the math string.

CAS = computational algebra system.

dladd commented 8 years ago

Edit: this comment is essentially a simultaneous duplicate of @nickerso 's above

are we not removing non-MathML elements from the math string anymore?

That's still the plan but in order to get to our cellml:units attributes, we parse the document as normal XML and have the cellml namespace declared. Otherwise, libxml throws an undefined namespace error and ignores those attributes.

what does CAS stand for exactly?

Sorry that's a Computer Algebra System (see #95) for symbolic manipulation. Current thinking is still leaning toward SymEngine.

agarny commented 8 years ago

Ok, thanks for the clarification and the meaning of CAS.

cellml / libcellml

Current thinking on validation #118