NASA-PDS / validate

Validates PDS4 product labels, data and PDS3 Volumes
https://nasa-pds.github.io/validate/
Apache License 2.0
16 stars 11 forks source link

Validate appears to not perform integrity checks for bundles/pass incorrect <Internal_Reference>s for collections #959

Open mace-space opened 4 months ago

mace-space commented 4 months ago

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

Related to #432 (was asked to open another ticket)

I ran validate using --rule pds4.bundle but no referential checks were performed (even though with that option it should check references):

 Summary:

   31739 product(s)
   100000 error(s)
   42256 warning(s)

   Product Validation Summary:
     30664      product(s) passed
     1075       product(s) failed
     0          product(s) skipped
     31739      product(s) total

   Referential Integrity Check Summary:
     0          check(s) passed
     0          check(s) failed
     0          check(s) skipped
     0          check(s) total

(Note the max error threshold has been exceeded).

I also tried running it on the specific collection where I had spotted LID errors:

% validate --rule pds4.collection --report-file rav1ciun_validate_browse_collection.log --verbose 2 --target ./wenkert_pdart16_vgr_rav1ciun/browse

Here's an example browse label from that collection:
1 \<?xml version="1.0" encoding="UTF-8" standalone="no"?> 2 3 \<?xml-model href="https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1G00.sch" 4 schematypens="http://purl.oclc.org/dsdl/schematron"?> 5 6 \<Product_Browse xmlns="http://pds.nasa.gov/pds4/pds/v1" 7 xmlns:pds="http://pds.nasa.gov/pds4/pds/v1" 8 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 9 xsi:schemaLocation="http://pds.nasa.gov/pds4/pds/v1 https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1G00.xsd"> 10 \<Identification_Area> 11 \urn:nasa:pds:wenkert_pdart16_vgr_rav1ciun:browse_qedr:vgr_1201-mamqtv-001010-data-001010.001.png</logical_identifier> 12 \1.0</version_id> 13 \RAV1CIUN DATA Browse Product - vgr_1201-mamqtv-001010-data-001010.001.png</title> 14 \<information_model_version>1.16.0.0</information_model_version> 15 \<product_class>Product_Browse</product_class> 16 \</Identification_Area> 17 \<Reference_List> 18 \<Internal_Reference> 19 \<lid_reference>urn:nasa:pds:wenkert_pdart16_vgr_rav1ciun:browse_qedr:vgr_1201-mamqtv-001010-data-001010.001</lid_reference> 20 \<reference_type>browse_to_data</reference_type> 21 \<comment>This is a reference to the full resolution data file corresponding to this browse image.\</comment> 22 \</Internal_Reference> 23 \</Reference_List> 24 \<File_Area_Browse> 25 \<File> 26 \<file_name>VGR_1201-MAMQTV-001010-DATA-001010.001.png</file_name> 27 \<local_identifier>BROWSE_FILE</local_identifier> 28 \<creation_date_time>2023-08-18</creation_date_time> 29 \</File> 30 \<Encoded_Image> 31 \<local_identifier>BROWSE_IMAGE</local_identifier> 32 \<offset unit="byte">0</offset> 33 \<encoding_standard_id>PNG</encoding_standard_id> 34 \</Encoded_Image> 35 \</File_Area_Browse> 36 \</Product_Browse></p> <p>Line 19 points to an incorrect LID, but Validate does not report any of these:</p> <pre><code> Referential Integrity Check Summary: 30582 check(s) passed 1 check(s) failed 0 check(s) skipped 30583 check(s) total</code></pre> <p>It passed all of the browse labels (the one fail refers to a .DS_Store file).</p> <p>So, unlike the <code>-R pds4.bundle</code> option, with the <code>-R pds4.collection</code> it does report referential integrity checks. However, it is not catching incorrect LIDs.</p> <p>The LID urn:nasa:pds:wenkert_pdart16_vgr_rav1ciun:browse_qedr:vgr_1201-mamqtv-001010-data-001010.001 does not exist (the browse LIDs have .png suffixes), although it shouldn't even be self-referencing the browse_qedr collection but rather the data_qedr collection.</p> <h3>🕵️ Expected behavior</h3> <p>Validate flag an error for non-existing LIDs</p> <h3>📜 To Reproduce</h3> <ol> <li><code>% validate --rule pds4.bundle --report-file rav1ciun_validate_v3.5.1.log --verbose 2 --target ./wenkert_pdart16_vgr_rav1ciun</code></li> <li><code>% validate --rule pds4.collection --report-file rav1ciun_browse_validate_v3.5.1.log --verbose 2 --target ./wenkert_pdart16_vgr_rav1ciun/browse</code></li> </ol> <h3>🖥 Environment Info</h3> <ul> <li>Validate v3.5.1</li> <li>MacOS 10.15.7</li> <li>Java 11.0.15: <pre><code>% java --version openjdk 11.0.15 2022-04-19 OpenJDK Runtime Environment Temurin-11.0.15+10 (build 11.0.15+10) OpenJDK 64-Bit Server VM Temurin-11.0.15+10 (build 11.0.15+10, mixed mode)```</code></pre></li> </ul> <h3>📚 Version of Software Used</h3> <p>Validate v3.5.1</p> <h3>🩺 Test Data / Additional context</h3> <p>Bundle tar.gz too large to attach here, shall I share via Dropbox or would you need just a sample?</p> <p><a rel="noreferrer nofollow" target="_blank" href="https://github.com/user-attachments/files/16168478/rav1ciun_validate_v3.5.1_log.zip">Bundle validate log</a> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/user-attachments/files/16168481/rav1ciun_validate_v3.5.1_browse_collection.log">rav1ciun_validate_v3.5.1_browse_collection.log</a></p> <h3>🦄 Related requirements</h3> <p><em>No response</em></p> <h3>⚙️ Engineering Details</h3> <p><em>No response</em></p> <h3>🎉 Integration & Test</h3> <p><em>No response</em></p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/jordanpadams"><img src="https://avatars.githubusercontent.com/u/33492486?v=4" />jordanpadams</a> commented <strong> 4 months ago</strong> </div> <div class="markdown-body"> <p>@mace-space is this data available online somewhere?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/jordanpadams"><img src="https://avatars.githubusercontent.com/u/33492486?v=4" />jordanpadams</a> commented <strong> 4 months ago</strong> </div> <div class="markdown-body"> <p>also, I took a look at the log files, and, at first glance it looks like we probably didn't catch some of these errors for a few reasons:</p> <ol> <li>As soon as validate encounters an issue with reading the schema, schematron, or label, it stops reading the file and fails. Our ability to read the data relies on the schemas and schematrons to work.</li> <li>One of the collections failed to load, so I think it may have just crashed trying to figure out what to do after that. But this is odd. I wonder if there is some sort of lid mismatch between what is the bundle.xml vs. what is in the collection labels?</li> </ol> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/jordanpadams"><img src="https://avatars.githubusercontent.com/u/33492486?v=4" />jordanpadams</a> commented <strong> 4 months ago</strong> </div> <div class="markdown-body"> <p>If sending the whole data set is not reasonable, even a small subset of the data with:</p> <ul> <li>bundle label</li> <li>collection labels/inventories</li> <li>the label that should be failing</li> <li>the label the failing file is referencing</li> </ul> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/mace-space"><img src="https://avatars.githubusercontent.com/u/15634946?v=4" />mace-space</a> commented <strong> 4 months ago</strong> </div> <div class="markdown-body"> <p>Thanks for looking into this @jordanpadams. Here's a <a href="https://github.com/user-attachments/files/16191417/wenkert_pdart16_vgr_rav1ciun_partial.zip">subset of data.</a></p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/jordanpadams"><img src="https://avatars.githubusercontent.com/u/33492486?v=4" />jordanpadams</a> commented <strong> 3 months ago</strong> </div> <div class="markdown-body"> <p>@mace-space it does not look like the ZIP file fully uploaded prior to submitting your comment. Would you mind trying to upload again?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/mace-space"><img src="https://avatars.githubusercontent.com/u/15634946?v=4" />mace-space</a> commented <strong> 3 months ago</strong> </div> <div class="markdown-body"> <p>Sorry for the delay, I've been on vacation. Here's the <a href="https://github.com/user-attachments/files/16417271/wenkert_pdart16_vgr_rav1ciun_partial.zip">subset of the bundle</a>. </p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>