documaster / noark-extraction-validator

GNU Affero General Public License v3.0
6 stars 3 forks source link

Customized schemas with optional parameter #25

Closed solfeggietto closed 7 years ago

solfeggietto commented 7 years ago

For debugging and enhanced documentation the option of using customized Noark 5-schemas is needed.

New parameter: -custom-schema-location /path/to/custom/Noark5-schemas/directory

The user shall ALWAYS use the defauilt schemas to document the compliance or lack of such to the standard schems. The custom schemas will be called for debugging and documentation of expected errors the customized schemas are adopted to validate against. The changes in the resulting report then will document to what extent the changes in customized schemas target those errors.

ivaylomitrev commented 7 years ago

I am writing this comment in an attempt to clarify (for myself and future users) the requested functionality and, possibly, introduce a minor modification to its specification. In previous discussions it has been requested that the new parameter -custom-schema-location replace (some or all of the) Noark XSD schemas during the validation process if they are found in the directory pointed to by the parameter.

Currently, the tool validates the provided XML files against two different sets of XSD schemas:

  1. The schemas distributed with the extraction package (incompatibilities with which reported as warnings with schema type PACKAGE)
  2. The Noark 5 schemas (incompatibilities with which reported as errors with schema type NOARK)

As a result of the new feature, the schemas described in 2. may (partially or fully) be replaced by the custom ones leading to the following changes in the validation procedure where the Noark schemas are concerned:

  1. The schemas distributed with the extraction package (incompatibilities with which reported as warnings with schema type PACKAGE)
  2. The A mixture of the Noark 5 schemas and the referenced custom schemas (incompatibilities with which reported as errors with schema type NOARK or schema type CUSTOM)

My biggest concern with the above is that a non-acquainted user may be easily deceived or confused by a validation report produced by the tool. This is because a validation report produced by a run with custom schemas would be inherently unreliable, but this would not be immediately recognizable; the only differences being that the -custom-schema-location parameter would be listed in the execution information and that some of the errors reported in the tests validating the structure of the XML files would be tagged as CUSTOM (instead of NOARK). In my opinion it would only be a matter of time until a clarification of the functionality gets requested by a user or a validation report is mistakenly considered trustworthy.

In order to prevent the above I suggest that the custom schemas not override the Noark schemas during the validation process. I suggest that the custom schemas simply provide another layer of validation producing the following validation procedure:

  1. The schemas distributed with the extraction package (incompatibilities with which reported as warnings with schema type PACKAGE)
  2. The Noark 5 schemas (incompatibilities with which reported as errors with schema type NOARK)
  3. (Optional, if provided) The custom schemas provided via the -custom-schema-location parameter (incompatibilities with which reported as errors with schema type CUSTOM) (If not all of the required schemas are found in the specified directory, the Noark schemas can be used as fallback)

The suggested procedure would eliminate the possibility of confusion and would produce a validation report that provides compliance information related to the bundled schemas, the Noark schemas, and the custom schemas in a single place.

@solfeggietto, @douzounov would you agree with my reasoning or would you prefer that the Noark schemas get replaced by the custom schemas?

solfeggietto commented 7 years ago

Using CUSTOM as a 3rd set of XSD schemas is an excellent solution to this issue.

Remarks on the Excel report (and corresponding xml and/or pdf similar report sections):

Execution: Description of the CUSTOM XSD schemas should be imported and displayed here (for example from an optional textfile "description.txt" in that schema-folder.

Summary: Accumulated numbers for Information, Warnings and Errors should be shown for total as well as the individual numbers from NOARK, PACKAGE and CUSTOM (if this parameter is called upon). In this way it will be clearly shown and documented if a CUSTOM correction for example date vs datetime and avskrivningsmaate have solved the 20 000 errors that was caused by those targeted changes in the CUSTOM XSD schemas. Some sort of categorizaton should be done on the individual integerty tabs (like P4) as to show how many errors are detected for different types and with what kind of error (an additional Issue which describes those details may be added).

A diff of the 3 sets of XSD schemas may be implemented later. The README section may suggest best usage of CUSTOM (like strongly suggest the NOARK XSD-schemas to be used as template prior to any set of CUSTOM XSD-schemas).

ivaylomitrev commented 7 years ago

The requirements have been implemented and will be included in release 0.2.0.