NASA-PDS / validate

Validates PDS4 product labels, data and PDS3 Volumes
https://nasa-pds.github.io/validate/
Apache License 2.0
15 stars 11 forks source link

Validate should not perform attribute value checks outside of schematron #540

Open jordanpadams opened 1 year ago

jordanpadams commented 1 year ago

🐛 Describe the bug

Current software does an explicit check for in lid reference, but schematron should be handling this, not validate source code.

I contact you as we have encountered a problem when running different versions of the PDS Validate Tool with the ExoMars2016 SPICE PDS4 Bundle (here the compressed bundle tar file).

The situation is that when we use the following command for running the validate 2.0.6, we get a clean run with no errors (validate result is attached): validate-2.0.6/bin/validate --rule pds4.bundle -t em16_spice/ -m 1B00 > em16_spice_206.validate

However, when we use the same command with the validate version 2.2.0, we get the following errors: validate-2.2.0/bin/validate --rule pds4.bundle -t em16_spice/ -m 1B00 > em16_spice_220.validate

 FAIL: file:/Users/aescalante/spice/missions/exomars/archives/npb_20220701/em16_spice_005/em16_spice/spice_kernels/mk/em16_v002.xml
  ERROR  [error.validation.invalid_field_value]   Unexpected carriage returns in tag 'lid_reference' with value 'urn:esa:psa:em16_spice:spice_kernels:ck_em16_tgo_hga_scm_20160315_20161101_s20190703_v01.bc'

This is indicating that there must have been an update in the PDS4 IM that led to treat split 'lid_reference' with lines to be treated as an error and that this was implemented in the Validate Tool from version 2.0.6 to 2.2.0.

🕵️ Expected behavior

This should and does throw an error with the latest version of the PDS4 IM, but we should not code this into validate source

📚 Version of Software Used

🩺 Test Data / Additional context

🏞Screenshots

🖥 System Info


🦄 Related requirements

⚙️ Engineering Details

msbentley commented 1 year ago

I agree that this check should be in Schematron in general, so that it can be caught by tools other than validate. But validate is of course always adding in checks that are not in Schematron (e.g. looking at the data), and incrementally improves over time, such that when we re-run validate on old data we may catch errors not previously discovered.

In this particular case (invalid characters in LIDs) the standards have always been very clear, and I assume the intent of validate is to enforce the standards? So regardless of where and how the check is coded, I would hope that this critical check is always performed, regardless of IM version.

More importantly, users have an expectation that well-formed data products follow the standards and can be read by coding tools according to those standards.

bsemenov commented 1 year ago

PDS4 is built on IM, not good intentions. All tools must honor each specific IM version as it was implemented, at least in all aspects defined by schema and schematron. Not having DOIs and lots of other really needed things in earlier IM versions can be considered bugs too. We fixed them but do not make tools enforce later fixes for products done with earlier IM versions. This is no different.

jordanpadams commented 1 year ago

@msbentley @bsemenov I actually entirely agree with both of you. I think all of these instances should be considered bugs (e.g. actual bugs in the schematron, missing metadata, etc.), and validate should catch them so they can be clearly documented for the user community. However, I also understand when you are an archivist, you want to be able to run validate against your data to see how valid it is for that point in time.

I see a future enhancement to validate and the overall PDS4 system here where we have 2 validation runs and reports:

  1. the one the archivist executes/documents to ensure their archive is as valid as it can be with the appropriate snapshot in time of the information model (for the archivist)
  2. the one the system executes/documents to more clearly describe to a user how this archive differs from the current information model (for the end user)

The former is something @bsemenov is asking us to support with validate, and the latter would be something our system would need to support, and will most likely need to be tracked through the Registry.