NASA-PDS / validate

Validates PDS4 product labels, data and PDS3 Volumes
https://nasa-pds.github.io/validate/
Apache License 2.0
16 stars 11 forks source link

As a user, I want to check that all Internal References are valid references to other PDS4 products within the current validating bundle #308

Closed mit3ch closed 3 years ago

mit3ch commented 3 years ago
  1. For more information on how to populate this new feature request, see the PDS Wiki on User Story Development: https://github.com/NASA-PDS/nasa-pds.github.io/wiki/Issue-Tracking#user-story-development

  2. Do the best you can with template. If it is too difficult to create a "story" just jot down as much info as you can.

Motivation

...so that I can ensure referential integrity of references within the bundle to other products within the same parent bundle

Additional Details

Need to confirm that every LID or LIDVID referenced in an Internal_Reference class exists.

Concatentate every LID/LIDVID in Identification Area from every label in a bundle. Then for each xml label, verify that each LID/LIDVID not in the Identification_Area is included in the concatenated list. If not giving a warning that there is a missing product. Ideally also check against all registered LIDs & LIDVIDs.

Check with Richard Chen, he has a python script that does this checking.

Acceptance Criteria

Given a product that contains one or more Internal_References to product LID/LIDVIDs within the same parent bundle When I perform validation of the bundle Then I expect to validate that all LIDs/LIDVIDs to products within the bundle are valid references

jordanpadams commented 3 years ago

thanks @mit3ch thought we had a ticket for this already, but apparently not. definitely on our radar, but may be a little more complicated than Richard's script because it has to encompass the entire PDS4 archive. may have to wait until the Registries are installed and all PDS4 data is ingested

mit3ch commented 3 years ago

Hi Jordan,

Can we do a phased introduction? For now have it check against LIDs/LIDVIDs in the bundle and give a warning if the referenced product is not in the bundle. Later we can expand the functionality to check against the registry. Richard's tool provides the short term functionality. It identified missing products in otherwise validated bundles. He sent me his code, but that doesn't help anyone else.

Mitch

Dr. Mitch Gordon SETI Institute Deputy Manager PDS Ring-Moon Systems Node 276-393-8822 Pronouns: he, him, his

mit3ch commented 3 years ago

 @mit3ch This is definitely a reasonable request. I will discuss with our engineer leading this dev to get an idea of how much effort he thinks this will be in order to weight that into the prioritization.

mit3ch commented 3 years ago

Hi Jordan,

I was writing a heads up email to let you know about the last minute issue, but GitHub got there first.

I think we should have phased introduction with an intermediate stage now and full capability once the registry is up. For the short term solution, just check against the products in the bundle and give a warning if there isn't a match. Richard's tool provides this functionality; it identified missing products in several bundles which passed validation. He's given me his code, but that doesn't help anyone else. We can add checking against the registry later.

Mitch

msbentley commented 3 years ago

Would such a warning/check fire only when the validation context was set to bundle? (most of the time I am validating product deliveries, so I wouldn't want warnings that referenced products were not found, simply because they were in a separate delivery, or had previously been delivered etc.)

jordanpadams commented 3 years ago

@msbentley great question. i wasn't thinking this would apply only to bundles, but maybe that would make more sense. we can maybe bring this to SWG for more clarification.

qchaupds commented 3 years ago

@jordanpadams Is there a good representative bundle in our test resources? There's no test resources provided for this ticket.

jordanpadams commented 3 years ago

here is some test data. I will send you the path on our servers.

But there are only a few products in there that contain references. Here is a snippet from one of the examples:

pds4-compil-comet-v1.0/pds4-compil-comet-v1.0/polarimetry/data/dbcp.xml

  <Reference_List>
    <Internal_Reference>
        <lidvid_reference>urn:nasa:pds:compil-comet:polarimetry:filters::1.0</lidvid_reference>
      <reference_type>data_to_document</reference_type>
    </Internal_Reference>
  </Reference_List>

You really just need to take any test bundle we have out there, and add a reference similar to this to a LIDVID of another product in the bundle.

If the LIDVID does not exist anywhere in that bundle, we should throw an error.

mit3ch commented 3 years ago

Jordan,

I should throw a warning, not an error. Products in one bundle routinely reference a product in another bundle. We just want to give the provider/user a heads up that there is a reference to a LID not in the bundle. A more complicated test would parse the LID to determine if the LID indicates the product is part of the bundle (first segment after u:n:p is the bundle base), throw an error for failure in that case and a warning otherwise. However, giving a warning in all cases should be sufficient.

Thanks,

Mitch

jordanpadams commented 3 years ago

copy that @mit3ch . we should be able to handle that logic.

@qchaupds I think it shouldnt be too complicated to make this happen. the way I see this is we should do this as part of the referential integrity checking we already do with pds4.bundle validation. We should maintain some sort of object/data structure (or may be we already have one) that contains any references within a product, and checks those as well. we can talk more about this offline if we want to provide some more clarification here.

qchaupds commented 3 years ago

We have good success so far.

Running validate against a bundle we know has issues.

% validate -R pds4.bundle -r report_github308_bundle_invalid.json -s json -t src/test/resources/github308/invalid/bundle_kaguya_derived.xml >& t2

There are 3 warnings for 2 labels regarding a reference pointing to a non-existent logical identifier.

{pds-dev3.jpl.nasa.gov}/home/qchau/sandbox/validate 110 % egrep "label|not found" report_github308_bundle_invalid.json

  "label": "file:/home/qchau/sandbox/validate/src/test/resources/github308/invalid/bundle_kaguya_derived.xml",
      "message": "A LID reference urn:nasa:pds:kaguya_grs_spectra:document:kgrs_calibrated_spectra is referencing a logical identifier for a product not found in this bundle."
      "message": "A LID reference urn:nasa:pds:kaguya_grs_spectra:document:kgrs_ephemerides_doc is referencing a logical identifier for a product not found in this bundle."
  "label": "file:/home/qchau/sandbox/validate/src/test/resources/github308/invalid/data_spectra/kgrs_calibrated_spectra_per1.xml",
      "message": "A LID reference urn:nasa:pds:kaguya_grs_spectra:document:kgrs_calibrated_spectra is referencing a logical identifier for a product not found in this bundle."

The reference urn:nasa:pds:kaguya_grs_spectra:document:kgrs_calibrated_spectra does not occur anywhere as a logical_identifier:

{pds-dev3.jpl.nasa.gov}/home/qchau/sandbox/validate/src/test/resources/github308/invalid 119 % grep -rn "urn:nasa:pds:kaguya_grs_spectra:document:kgrs_calibrated_spectra" . | grep logical_identifier

The reference urn:nasa:pds:kaguya_grs_spectra:document:kgrs_ephemerides_doc does not occur anywhere as a logical identifier:

pds-dev3.jpl.nasa.gov}/home/qchau/sandbox/validate/src/test/resources/github308/invalid 120 % grep -rn "urn:nasa:pds:kaguya_grs_spectra:document:kgrs_ephemerides_doc" . | grep logical_identifier

There is a label src/test/resources/github308/invalid/VALID_odf07155_msgr_11.xml but its logical identifier urn:nasa:pds:mess-rs-raw:data.odf:mess_rs_07155_156_60s_odf does not belong to the "urn:nasa:pds:kaguya_grs_spectra" bundle so the warning is not raised.

{pds-dev3.jpl.nasa.gov}/home/qchau/sandbox/validate/src/test/resources/github308/invalid 124 % grep logical_identifier /home/qchau/sandbox/validate/src/test/resources/github308/invalid/VALID_odf07155_msgr_11.xml

urn:nasa:pds:mess-rs-raw:data.odf:mess_rs_07155_156_60s_odf

However, the label does get a warning for not belong to anyone which is expected.

{
  "status": "PASS",
  "label": "file:/home/qchau/sandbox/validate/src/test/resources/github308/invalid/VALID_odf07155_msgr_11.xml",
  "messages": [
    {
      "severity": "WARNING",
      "type": "warning.integrity.unreferenced_member",
      "message": "Identifier 'urn:nasa:pds:mess-rs-raw:data.odf:mess_rs_07155_156_60s_odf::1.0' is not a member of any collection within the given target"
    }
rchenatjpl commented 3 years ago

@qchaupds @jordanpadams val308b.zip

In the attached, validate should catch that the browse product's reference to a LID in this bundle doesn't exist. Eventually and maybe ideally, validate should catch that the data product's reference to a LID outside this bundle doesn't exist. Search for "xxx" in the .xml files. Validate now catches neither, though it does erroneously catch something related to validate#69

jordanpadams commented 3 years ago

thanks @rchenatjpl . I created a new ticket for the bug you found here: https://github.com/NASA-PDS/validate/issues/432

per your comment about catching LIDs outside this bundle, that is in our plans for next build once we have the data ingested into the registry