NASA-PDS / validate

Validates PDS4 product labels, data and PDS3 Volumes
https://nasa-pds.github.io/validate/
Apache License 2.0
16 stars 11 forks source link

validate does not flag *.tab files (fixed width character tables) used for collection inventory tables (delimited tables) #390

Open radiosci opened 3 years ago

radiosci commented 3 years ago

🐛 Describe the bug

A Collection product with an Inventory file (subclass of Table_Delimited) has one 63-byte record and two 64-byte records. It is given a file name ending with *.tab (reserved for tables with fixed length records). Validate passes the product.

📜 To Reproduce

Steps to reproduce the behavior:

  1. Go to folder VALIDATE-TAB in the attached validate-tab.tar.gz file
  2. Note that collection_calib_freq.tab has one 63 byte record, two 64-byte records
  3. Type "validate collection_calib_freq.xml"
  4. Output will show that validate has passed the product

🕵️ Expected behavior

Validate should report that *.tab files must have fixed-length fields

📚 Version of Software Used

validate 2.0.7

🩺 Test Data / Additional context

VALIDATE-TAB has the bad file that passed. VALIDATE-CSV has a CSV version that also passed. Both folders have copies of their respective report files.

validate-tab.tar.gz validate-csv.tar.gz

🏞Screenshots

None. See report files in respective folders.

🖥 System Info


🦄 Related requirements

⚙️ Engineering Details

msbentley commented 2 years ago

Out of interest, could someone point me to the docs that only allow Table_Character in a .tab file? I would certainly consider it best practise, but I don't recall seeing this documented. Or are we only talking about collection inventories here? Just to be sure that the fix doesn't affect existing product. Thanks!

jordanpadams commented 2 years ago

@msbentley agreed. unless I am missing something filenames are all recommendations in the PDS (besides for bundles / collections).

jordanpadams commented 2 years ago

@radiosci is there another error you are pushing for there? I don't think a file suffix is required for bundles/collection inventories?

radiosci commented 2 years ago

SR 6C.1.6 reserves *.tab for fixed-width tabular data. Inventory generally has variable width data in field 2. I am not saying that Inventory must have a specific file name extension. The example I provided has variable width data in field 2, a violation of 6C.1.6; but Validate did not report the error.

msbentley commented 2 years ago

Great, thanks @radiosci I hadn't paid enough attention to this part of the SR!

rchenatjpl commented 2 years ago

@jordanpadams Hi, where's the version of validate that I should test? I downloaded from https://github.com/NASA-PDS/validate/releases/tag/v2.2.1-SNAPSHOT and it's passing both of Dick's .xml files. % validate-2.2.1-SNAPSHOT/bin/validate -V gov.nasa.pds:validate Version 2.2.1-SNAPSHOT Release Date: 2022-04-07 16:27:49 % % % validate-2.2.1-SNAPSHOT/bin/validate -t VALIDATE-CSV/collection_calib_freq.xml

PDS Validate Tool Report
Configuration:
   Version                       2.2.1-SNAPSHOT
   Date                          2022-05-16T18:40:13Z
Parameters:
   Targets                       [file:/Users/rchen/Desktop/VALIDATE-CSV/collection_calib_freq.xml]
   Severity Level                WARNING
   Recurse Directories           true
   File Filters Used             [*.xml, *.XML]
   Data Content Validation       on
   Product Level Validation      on
   Max Errors                    100000
   Registered Contexts File      /Users/rchen/Desktop/validate-2.2.1-SNAPSHOT/resources/registered_context_products.json
Product Level Validation Results
  PASS: file:/Users/rchen/Desktop/VALIDATE-CSV/collection_calib_freq.xml
        1 product validation(s) completed
Summary:
  0 error(s)
  0 warning(s)
  Product Validation Summary:
    1          product(s) passed
    0          product(s) failed
    0          product(s) skipped
  Referential Integrity Check Summary:
    0          check(s) passed
    0          check(s) failed
    0          check(s) skipped
End of Report
Completed execution in 3142 ms

% % % validate-2.2.1-SNAPSHOT/bin/validate -t VALIDATE-TAB/collection_calib_freq.xml

PDS Validate Tool Report
Configuration:
   Version                       2.2.1-SNAPSHOT
   Date                          2022-05-16T18:41:22Z
Parameters:
   Targets                       [file:/Users/rchen/Desktop/VALIDATE-TAB/collection_calib_freq.xml]
   Severity Level                WARNING
   Recurse Directories           true
   File Filters Used             [*.xml, *.XML]
   Data Content Validation       on
   Product Level Validation      on
   Max Errors                    100000
   Registered Contexts File      /Users/rchen/Desktop/validate-2.2.1-SNAPSHOT/resources/registered_context_products.json
Product Level Validation Results
  PASS: file:/Users/rchen/Desktop/VALIDATE-TAB/collection_calib_freq.xml
        1 product validation(s) completed
Summary:
  0 error(s)
  0 warning(s)
  Product Validation Summary:
    1          product(s) passed
    0          product(s) failed
    0          product(s) skipped
  Referential Integrity Check Summary:
    0          check(s) passed
    0          check(s) failed
    0          check(s) skipped
End of Report

% % % diff -r VALI*

Only in VALIDATE-CSV: collection_calib_freq.csv
Only in VALIDATE-TAB: collection_calib_freq.tab
diff -r VALIDATE-CSV/collection_calib_freq.xml VALIDATE-TAB/collection_calib_freq.xml
49c49
<             <file_name>collection_calib_freq.csv</file_name>
---
>             <file_name>collection_calib_freq.tab</file_name>
Only in VALIDATE-TAB: report-tab.txt
Only in VALIDATE-CSV: report_csv.txt
viviant100 commented 2 years ago

@rchenatjpl Yes, v2.2.1-SNAPSHOT is the one to for B12.1.

jordanpadams commented 2 years ago

@rchenatjpl see https://github.com/NASA-PDS/validate/issues/491#issuecomment-1128036206

jpl-jengelke commented 2 years ago

@rchenatjpl Yes, v2.2.1-SNAPSHOT is the one to for B12.1.

Actually, testing was against 2.2.0

jordanpadams commented 2 years ago

@rchenatjpl @jpl-jengelke I finally dug into this ticket some more and see the confusion here. per @radiosci above, SR 6C.1.6 states that tab is a reserved file name extension for fixed-width tabular data file. validate should be raising an error for that because a collection is a delimited table, not a fixed width character table.

@radiosci I would like for this to be brought to the DDWG for clarification. for both the csv and tab reserved file extension, these should be more explicitly called out in the SR per Table_Character (and its extensions) Table_Delimited (and its extensions like Inventory tables). I am concerned about implementing a check for this when a lot of these file naming constraints have turned out to be recommendations in the SR, versus explicit requirements for validate.

viviant100 commented 2 years ago

Actually, testing was against 2.2.0

The latest test report (build12.1testProcs_v5.docx) @gxtchen sent out documents using Validate v2.2.1-SNAPSHOT for testing (Section 1.2 Scope). Please update the test report if that's not the case.

jpl-jengelke commented 2 years ago

Perhaps if this is revisited, the error should be something like, "Collections require character delimited data," or "Collections do not support fixed-width data in *.tab files". Nonetheless, the test data did not support the ticket, and the bug could not be verified in non-collection data.

radiosci commented 2 years ago

Jordan Padams: I don't understand what needs to be clarified by DDWG. There its no explicit requirement that Inventory use the .csv file name extension (SR 9C); but it must have delimited records. It is possible for a delimited table also to have fixed-width fields, in which case, .tab would be allowed. The example submitted was an Inventory with a .tab file name extension and variable length fields/records. Since .tab is reserved for "fixed-width tabular data" (SR Table 6C-1), it cannot have variable length fields. Validate appears to be ignoring the Table 6C-1 requirement once the product is determined to be Inventory. There are some additional requirements that could be imposed such as (1) Inventory file name extension must be .csv, and (2) .tab file must have fixed-length fields (including any record suffix bytes, which would be counted as the final field whether specified or not). Neither seems necessary in order to meet the Table 6C-1 requirement. Dick Simpson/radiosci

jordanpadams commented 2 years ago

@radiosci totally agree to what SR Table 6C-1 says, however, the Inventory definition in the label itself says nothing about the fixed width nature of the table. All other fixed width tables in PDS4 require an explicit record length be defined, and Inventory tables (an extension of Table_Delimited) only allows a max record length be defined. We cannot assume fixed width without explicit record length given.

so by it's current definition, I would claim an Inventory table cannot be a fixed width table, which then would mean Inventory tables cannot be *.tab. This is where I think we need input from the DDWG, because I don't think many people in the PDS would necessarily agree with that file naming restriction, especially when it comes to collection products.

radiosci commented 2 years ago

the Inventory definition in the label itself says nothing about the fixed width nature of the table. Correct; Inventory is (in this version of the IM) a subclass of Table_Delimited, a Parsable_Byte_Stream. Table_Base requirements cannot be applied. I would claim an Inventory table cannot be a fixed width table, which then would mean Inventory tables cannot be .tab.
I disagree; the
.tab file name extension says only that the file has fixed-width tabular structure. *.tab is not reserved for Table_Base. It is possible for a file to be both Table_Delimited and fixed-width tabular.

jordanpadams commented 2 years ago

@radiosci copy. well due to this disagreement, clarification will be needed through the DDWG and CCB to enforce this check in validate.

radiosci commented 2 years ago

Submitted CCB-347 to Jira requesting DDWG clarification.

jstone-psi commented 2 years ago

I found while reviewing this for CCB that this is also covered in 2B.2.2.2 of the SR:

2B.2.2.2 Subdirectories

Beneath the root directory in this example structure, the top-level subdirectories have a one-toone correspondence with the bundle’s collections. Each collection subdirectory has a collection label with a name of the form “collection[*] and having an extension of either “.xml” or “.lblx”. and a collection inventory table with a name of the form “collection[].csv”. (Sometimes “collection[_]_inventory.csv” is used.) See Section 9C.1 for details on construction of collection inventory tables.

The naming requirements here are even more stringent than what @radiosci requested. But it's in a location that nobody expects.

radiosci commented 2 years ago

SR 2B.2 (and its subsections) apply only in certain cases and are "suggested".