NASA-PDS / validate

Validates PDS4 product labels, data and PDS3 Volumes
https://nasa-pds.github.io/validate/
Apache License 2.0
15 stars 11 forks source link

Validate is not correctly reading online schemas/schematrons when they are not provided via command-line #879

Open rgdeen opened 2 months ago

rgdeen commented 2 months ago

Checked for duplicates

No - I haven't checked

๐Ÿ› Describe the bug

I have been doing bundle validation of my VGR PDART delivery using a local copy of all the DDs. Just before final delivery, I decided to try it with the delivered version of the DDs (i.e. let validate pick up the DD's from the web). Much to my surprise, it caught an error in that has been there since the beginning!

Looking into it, I discovered that my local DD copies had, for the PDS core only, PDS4_PDS_1G00.xsd.xml and PDS4_PDS_1G00.sch.xml ... i.e., a spurious .xml extension. This means the PDS core DD was not available in my local directory.

Jordan said he thought it was supposed to default back to the web if a DD was not found, but that's apparently not the case, as the bad should have been flagged in this case. I suspect it simply had no PDS core xsd/sch files to work with!

I'm surprised this didn't cause other issues... but there should at least be an info message printed if a LDD is referenced but is not available. Perhaps it does... I seem to remember errors along those lines in the past... but if so that behavior does not seem to extend to the PDS core DD.

๐Ÿ•ต๏ธ Expected behavior

I expected it to notify me if the core xsd/sch was unavailable. At least a warning.

๐Ÿ“œ To Reproduce

Command line used was:

project/m20/rgd/pds4/validate/validate-2.1.4/bin//validate -target bundle --report-file bundle.valrpt --schematron /home/rgd/work/rav1ciun/validate/dd//*.sch --schema /home/rgd/work/rav1ciun/validate/dd//*.xsd -R pds4.bundle

for the local version, with the --schematron and --schema parameters removed for the use-online-DD version.

Here is another concrete example with latest version of validate (see test data for the data used):

$ validate-3.5.0-SNAPSHOT/bin/validate --target collection_data_seed.xml --schema ~/test/PDS4_IMG_1G00_1810.xsd --schematron ~/test/PDS4_IMG_1G00_1810.sch
...
Summary:

  1 product(s)
  0 error(s)
  0 warning(s)

  Product Validation Summary:
    1          product(s) passed
    0          product(s) failed
    0          product(s) skipped
    1          product(s) total

  Referential Integrity Check Summary:
    0          check(s) passed
    0          check(s) failed
    0          check(s) skipped
    0          check(s) total

But it fails as expected when running without those LDDs:

$ validate-3.5.0-SNAPSHOT/bin/validate --target collection_data_seed.xml
...
Product Level Validation Results

  FAIL: file:/Users/jpadams/proj/pds/pdsen/workspace/validate/src/test/resources/github408/valid/data/collection_data_seed.xml
      ERROR  [error.label.schematron]   line 134, 26: The attribute pds:collection_type must be equal to one of the following values 'Browse', 'Calibration', 'Context', 'Data', 'Document', 'Geometry', 'Miscellaneous', 'SPICE Kernel', 'XML Schema'.
        1 product validation(s) completed

Summary:

  1 product(s)
  1 error(s)
  0 warning(s)

  Product Validation Summary:
    0          product(s) passed
    1          product(s) failed
    0          product(s) skipped
    1          product(s) total

  Referential Integrity Check Summary:
    0          check(s) passed
    0          check(s) failed
    0          check(s) skipped
    0          check(s) total

  Message Types:
    1            error.label.schematron

๐Ÿ–ฅ Environment Info

RHE Linux:

mipldevlinux13% uname -a                                                                                                                                                                 Linux mipldevlinux13 4.18.0-513.18.1.el8_9.x86_64 #1 SMP Thu Feb 1 03:51:05 EST 2024 x86_64 x86_64 x86_64 GNU/Linux
mipldevlinux13% more /etc/redhat-release 
Red Hat Enterprise Linux release 8.9 (Ootpa)

๐Ÿ“š Version of Software Used

Validate 2.1.4

mipldevlinux13% /project/m20/rgd/pds4/validate/validate-2.1.4/bin//validate --version

gov.nasa.pds:validate
Version 2.1.4
Release Date: 2021-12-23 16:39:55

Copyright 2019, by the California Institute of Technology ("Caltech").
All rights reserved.

๐Ÿฉบ Test Data / Additional context

test.tar.gz

๐Ÿฆ„ Related requirements

โš™๏ธ Engineering Details

jordanpadams commented 2 months ago

@rgdeen I was able to replicate. We will add to the backlog.

As a note, we highly recommend using the online schemas and the latest version of validate wherever possible to ensure accurate validation of products.

al-niessner commented 2 months ago

@jordanpadams

Is this a problem?

  1. the version being used is ancient 2.1.4. We are on 3.
  2. your example of 3.5 shows it working as the user requested so is this not just have user upgrade
  3. if a fix is expected then it must be to the older code 2.1.4

Is the expectation that if the file(s) given by the user are not found, it should be a warning? It is already caught here:

    } catch (TransformerException te) {
      // should never get here
      handler.addProblem(new ValidationProblem(
          new ProblemDefinition(ExceptionType.FATAL, ProblemType.SCHEMATRON_ERROR, te.getLocalizedMessage()),
          url));        }
  }

Note: if the schematron exist and valid it just never happens.

rgdeen commented 2 months ago

@al-niessner actually the 3.5 example fails as I initially reported. The issue is that the validation for collection_type is in the PDS core sch. So when validate is run with --schematron it loads the given sch's but, since there is no PDS4_PDS_1G00.sch in the --schematron list (or in my case, a mis-named one), it does not notice this and (apparently) loads NO sch for the PDS core. So the invalid label passes... because no rules are loaded saying it's bad. This may be unique to the core DD, I don't know. Perhaps because it's the default namespace it doesn't "notice" there's no sch for it? In any case it should at least be a warning if an expected sch or xsd is just flat out missing.

jordanpadams commented 2 months ago

@al-niessner, @rgdeen is correct. my repeat of this issue with the latest version of validate shows that it is still not doing this properly. Per my example above, if I give it only the IMG schemas/schematrons via command-line, it validates successfully:

validate-3.5.0-SNAPSHOT/bin/validate --target collection_data_seed.xml --schema ~/test/PDS4_IMG_1G00_1810.xsd --schematron ~/test/PDS4_IMG_1G00_1810.sch

But it shouldn't, the output should be the same as:

$ validate-3.5.0-SNAPSHOT/bin/validate --target collection_data_seed.xml

For example this is what should happen (this is not what happens now):

$ validate-3.5.0-SNAPSHOT/bin/validate --target collection_data_seed.xml --schema ~/test/PDS4_IMG_1G00_1810.xsd --schematron ~/test/PDS4_IMG_1G00_1810.sch
...
Product Level Validation Results

  FAIL: file:/Users/jpadams/proj/pds/pdsen/workspace/validate/src/test/resources/github408/valid/data/collection_data_seed.xml
      ERROR  [error.label.schematron]   line 134, 26: The attribute pds:collection_type must be equal to one of the following values 'Browse', 'Calibration', 'Context', 'Data', 'Document', 'Geometry', 'Miscellaneous', 'SPICE Kernel', 'XML Schema'.
        1 product validation(s) completed

Summary:

  1 product(s)
  1 error(s)
  0 warning(s)

  Product Validation Summary:
    0          product(s) passed
    1          product(s) failed
    0          product(s) skipped
    1          product(s) total

  Referential Integrity Check Summary:
    0          check(s) passed
    0          check(s) failed
    0          check(s) skipped
    0          check(s) total

  Message Types:
    1            error.label.schematron
al-niessner commented 2 months ago

@jordanpadams @rgdeen

Thanks. I misunderstood the bit at the top and thought that it was saying 3.5 was doing it right. So the question is, why is 1G overriding 1B not living next to it. If my memory is correct, it revolves around people trying to override the schema with one of the their own for the next batch of improvements. Let me dig into it now that I understand it better.

jordanpadams commented 2 months ago

@al-niessner correct. We sometimes want to โ€œoverwriteโ€ with newer versions of a schema or schematron

al-niessner commented 2 months ago

@jordanpadams

Okay, it is back to here: https://github.com/NASA-PDS/validate/blob/cbefe2f1e0798df6f41d21c1e964daa547d1b764/src/main/java/gov/nasa/pds/tools/label/LabelValidator.java#L554-L593

The variable useLabelSchematron comes from the command line --force which is marked as deprecated but clearly says that using schema or schematron flags turns off items in label. #599 seemed to fix the schema side. It looks like we are now doing the same for schematron.

If we remove userLabelSchematron, then it will always load what is in the label. If developing a new schematron that is not deployed, then will get a cannot find error. If point at old, it will run and presumably fail for reasons updating schematron. Either way, false negative because failing to CLI design not label, schema, or schematron.

Do we add new CLI flag that says merge online and CLI schematrons with CLI preference? Other than this test case does this make sense for any real world use case? I guess fixing false positives in the schema/schematron would benefit from the merge. Fixing false negatives cannot use the merge because the ones in the label will always generate the false positive until the fix is published.

Can we get rid of CLI flags for schema and schematron and force the label to have the correct references? If the desire is to use a test schema/schematron then go from https://schematron to file:///fixed_schematron. This way validate always uses the label and all confusion in validate is gone?

jordanpadams commented 2 months ago

Do we add new CLI flag that says merge online and CLI schematrons with CLI preference?

I think the default functionality for --schematrons should be to merge. If someone gives me some schematrons via command-line, great, we will use those. Otherwise, go online and download the ones we don't have.

Can we get rid of CLI flags for schema and schematron and force the label to have the correct references?

We can't make this assumption for many reasons, e.g.

  1. this would be a significant change to some existing pipelines. using the online version is ideal, but some pipelines don't care and still read them from a local cache
  2. testing labels off a pipeline where you want to validate, but have not released the necessary dictionaries yet and don't want to futz with the pipeline templates you are using to switch from file:/// to https://, especially after some operational I&T cycles
  3. trying to test upgrading existing labels to the latest IM - we plan on actually doing this for all labels at some point in order to be able to tell a user how products may not be compatible with the current version of the IM.
jordanpadams commented 2 months ago

@al-niessner โ˜๏ธ

rgdeen commented 2 months ago

I think there are two ways to go: merge with local overriding, or just warn for missing ones. Probably merge is easier for end users as long as you pull the right IM major version. Warning is okay though if you don't want to do the merge. Wonder if merge should be an option, with warning if not?

jordanpadams commented 2 months ago

@rgdeen we talked about this some more, and this is actually a huge can of worms trying to unravel this. We should throw some sort of warning when the PDS core schematron is not provided, but there are lots of curveballs that have come with trying to do this check that I think is going to be 1 step forward 2 steps back.

In the end, to ensure your validation is correct, we recommend using what is in the label.

jordanpadams commented 2 months ago

If this isn't critical. I am going to close this as wontfix since there is a known workaround (use what is in the label).

rgdeen commented 2 months ago

Hmmm... well, is it "critical"? Here's the scenario, you decide. The VGR LDD only just got released. Until release happens, you have no choice but to use local dictionaries. I thought I had a complete copy of all the LDDs in my local directory (I did, just misnamed the core ones). So I went all the way through development of the data products thinking validation was passing. I'm ready to hand off to the node (to myself ;-) ) and so I ran validate without the options. Only at that point did it notice the errors from the core. Fortunately, the errors were minor and easily fixed with a Velocity rerun, but I could easily see having some major structural issues that are uncaught by the data provider. Who in this case really couldn't run with the web versions, as the VGR DD was not released until the project was basically done (because we kept updating it with problems).

So it kind of violates the principle of least surprise when running validate, that a simple error like misnaming the LDD (which happened because the Mac insists on adding .xml to everything when downloading and I forgot to tell it not to in this case) means you haven't really validated like you thought you had. Validate is so picky otherwise, should it be that fragile, really?

Here's another interesting tidbit. If I leave out DISP from the dictionary set, I get an error:

FAIL: file:/tmp/V1NA_5566630_RAW.xml ERROR [error.label.schema] line 162, 30: cvc-complex-type.2.4.c: The ma tching wildcard is strict, but no declaration can be found for element 'disp:Dis play_Settings'.

However... if I leave out IMG I get no such error... I think because MARS2020 and MSSS_CAM_MH have an xs:element entry referring to img:SomeClass. The first error could lead people to believe that validate does tell you if an LDD is missing (I sure thought so)... but it's not just core but other cases also where it doesn't.

So how can someone during data product development ever be sure they got all the LDD's downloaded? There's no way to tell.

I'll leave it up to y'all to decide if it's "critical" or not... I really don't know how bad the worms are. But it did have non-trivial effects in this case.

jordanpadams commented 2 months ago

@rgdeen understood. we can try to support including the core in every validation, but the difficulty comes in the corner cases. We could enable mapping based upon filename, but that assumes the filenames are all the same, which has only become consistent in the last few years. Otherwise, reading the schematron and schemas on there own does not give a clear understanding of the namespace they pertain to. The namespace is in there, but so are potentially other namespaces. Which namespace is this one?

e.g.

  <sch:ns uri="http://pds.nasa.gov/pds4/pds/v1" prefix="pds"/>
  <sch:ns uri="http://pds.nasa.gov/pds4/cart/v1" prefix="cart"/>
  <sch:ns uri="http://pds.nasa.gov/pds4/geom/v1" prefix="geom"/>

Which namespace pertains to this one?

Option 1: We could refactor the way --schemas and --schematrons works by requiring a namespace mapping to the files you want to include, e.g.

--schema /path/to/my/schema::http://pds.nasa.gov/pds4/pds/v1
--schematron /path/to/my/schematron::http://pds.nasa.gov/pds4/pds/v1

But this blows up a lot of existing pipelines.

Option 2: Assume the filename prefixes (e.g. PDS4_IMG_) match what are in the schema and schematron, if no match is found, we throw a WARNING that this LDD was not found in a label. If the filename prefixes do not match, not sure how to handle that. We would probably need a new flag for --schema-override and --schematron-override to just ignore the schemas/schematrons in the label and use what is given via command-line.

jordanpadams commented 2 months ago

@rgdeen We are moving this to the icebox for the time being. there are some significant technical challenges in the validate code involved in making this change, and we have bigger fish to fry. We will revisit this at a later time. Apologies for the inconvenience.

rgdeen commented 2 months ago

oh interesting... the mapping of "img:" to "http://pds.nasa.gov/pds4/img/v1" is actually in the product label, not the schema. I guess in principle that means a product label could map a LDD to something non-canonical (e.g. "image:" instead of "img:") but hopefully everyone would reject that. If it were even noticed though, and I guess that's the crux of the problem. We really shouldn't condone LDD's that don't match the naming conventions, so if someone used "image:" for img or called it MY_DICT.xsd that should trigger a warning. Which leads to option 2 above, assume the filenames have the namespace in them because if they don't that's a bad thing.

One thing to consider: this should be just a warning. So if there's a name mismatch or something, it'll do exactly what it does now... continue to validate, just issue an additional warning, which could be ignored. The point is to make sure people don't inadvertently omit a ldd (or core dd), not to "fix" the problem by falling back to another DD.

Anyway, icebox is fine until you figure out a good way to deal with it.

jordanpadams commented 2 months ago

thanks @rgdeen

One thing to consider: this should be just a warning. So if there's a name mismatch or something, it'll do exactly what it does now... continue to validate, just issue an additional warning, which could be ignored. The point is to make sure people don't inadvertently omit a ldd (or core dd), not to "fix" the problem by falling back to another DD.

We discussed this as well. Heading into next release, we will work with the SWG to prioritize issues, and we can discuss some design options for this one if it bubbles up as a priority. In the end, hopefully the issue will eventually be caught when a data provider starts validating with released LDDs. Obviously not ideal when you are trying to test a pipeline, but any issues like you identified with your labels above should eventually get caught.