NASA-PDS / validate

Validates PDS4 product labels, data and PDS3 Volumes
https://nasa-pds.github.io/validate/
Apache License 2.0
16 stars 11 forks source link

The PDF verification / VeraPDF component of Validate seems to error on Windows paths #1008

Closed eischaefer closed 1 month ago

eischaefer commented 1 month ago

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

Validation of the attached PDF (see To Reproduce) with Validate 3.5.2 gives the error:

ERROR  [error.validation.internal_error]   Error occurred while processing PDF file content for example.pdf:  Unable to read with VeraPDF standard reader. Illegal char <:> at index 2: /C:/path/to/\example.pdf

The PDF is indeed not PDS4-compatible, but the online demo VeraPDF reports a very different set of issues, nowhere referring to an illegal character nor an inability to read the content. The character ":" is also plausibly at "index 2" in the path (depending on how one counts), which suggests to me that parsing the path itself is the root cause.

I'm not sure whether the invalid /\ before the filename or the likewise invalid (albeit POSIX-like) leading / are relevant, but they are both absent from the 3.2.0 output, which gives the expected error:

ERROR  [error.pdf.file.not_pdfa_compliant]   Validation failed for flavour PDF/A-1b.  Detailed error output can be found at C:\path\to\example.pdf.1b.error.csv

🕵️ Expected behavior

I expected Validate to read the PDF content and report issues similar to the online demo VeraPDF.

📜 To Reproduce

  1. Download example.pdf and example.xml from here. Note: Download link updated in edit.
  2. Run validate --target C:\path\to\example.xml with Validate 3.5.2. Note: Typo corrected in edit. (Original target was erroneously .pdf, not .xml.)

🖥 Environment Info

📚 Version of Software Used

🩺 Test Data / Additional context

No response

🦄 Related requirements

🦄 #xyz

⚙️ Engineering Details

No response

🎉 Integration & Test

No response

jordanpadams commented 1 month ago

@eischaefer we will add this to the list

al-niessner commented 1 month ago

@eischaefer @jordanpadams

It will take me a while (days) to stand up a windows platform again. However, I can immediately see what is wrong and it can be fixed on the command line (I hope). Instead of:

validate --target C:\path\to\example.pdf

use

validate --target file:///C:/path/to/example.pdf (yes 3 slashes after file:)

I know the documents do not say this but they are written for *nix and for those a path correctly becomes file:///path/to/example.pdf. The Java libraries do not do so well with windows. Beyond updating the documentation for windows, not sure we can make any substantial changes to validate to make it work reliably.

I will continue standing up a windows platform in case the suggestion does not work.

eischaefer commented 1 month ago

@al-niessner , thank you very much for your effort!

I apologize, but I gave the wrong command in the first comment. The correct command is: validate --target C:\path\to\example.xml where target points to the .xml, not the .pdf.

Styling that with a URI, as you suggested: validate --target file:///C:/path/to/example.xml gives the same error as before.

Note that that error references the .pdf, not the .xml, so it seems to me that:

  1. Whether a URI or regular Windows path is passed for target, Validate correctly interprets that path.
  2. However, when Validate combines <file_name> from the .xml's content with the passed target to resolve the complete path to the .pdf, the result is invariably of the form /C:/path/to/\example.pdf, which VeraPDF does not understand.
    • Note that the final interpreted form (mostly) uses /'s and starts with a / even when a regular Windows path (C:\path\to\example.xml) is passed, so some URI-like conversion must be occurring internally. Thankfully, that suggests that passing a regular path (on Windows or *nix) might be OK as long as the internal path conversion is fixed.

Incidentally, note that this exact same command (without resorting to a URI) works flawlessly in Validate 3.2.0 (as I noted in my original post), in case that's of help when debugging.

al-niessner commented 1 month ago

@eischaefer

Thanks for the update; it makes more sense now. I am almost done with my windows platform and will debug it.

That this problem exists is not a surprise. We had several other URI/URL problems with windows that required more pedantic handling of them (URLs). It is not a surprise that one or more code paths were missed during those updates as I work with limited sets of test data at a time. I cannot download your data to debug this issue. I will need both the XML and PDF in question. The link at the top of the ticket tries to open a validate issue called example.pdf.

eischaefer commented 1 month ago

@al-niessner , I have updated To Reproduce in the first comment with correct instructions. Please let me know if you need anything else from me.

al-niessner commented 1 month ago

Thanks. I have them both.

al-niessner commented 1 month ago

When run on linux get this as output (base line of expectation for windows platform):

PDS Validate Tool Report

Configuration:
   Version     3.6.0-SNAPSHOT
   Date        2024-10-09T21:13:27Z

Parameters:
   Targets                      [file:/home/niessner/Projects/PDS/validate/src/test/resources/github1008/example.xml]
   Severity Level               WARNING
   Recurse Directories          true
   File Filters Used            [*.xml, *.XML]
   Data Content Validation      on
   Product Level Validation     on
   Max Errors                   100000
   Registered Contexts File     /home/niessner/Projects/PDS/validate/target/classes/util/registered_context_products.json

Product Level Validation Results

  FAIL: file:/home/niessner/Projects/PDS/validate/src/test/resources/github1008/example.xml
      ERROR  [error.pdf.file.not_pdfa_compliant]   Validation failed for flavour PDF/A-1b in file example.pdf.
        1 product validation(s) completed

Summary:

  1 product(s)
  1 error(s)
  0 warning(s)

  Product Validation Summary:
    0          product(s) passed
    1          product(s) failed
    0          product(s) skipped
    1          product(s) total

  Referential Integrity Check Summary:
    0          check(s) passed
    0          check(s) failed
    0          check(s) skipped
    0          check(s) total

  Message Types:
    1            error.pdf.file.not_pdfa_compliant

End of Report
Completed execution in 11332 ms

Not exactly detailed as to why but fails from PDF not being A/B compliant rather than internal error. Moving to windows platform for more testing. If the full details of non-compliance are desired, then use the --pdf-error-dir.

eischaefer commented 1 month ago

When run on linux get this as output (base line of expectation for windows platform):

Yep! As stated in my original comment, this is what I would hope to see on Windows for this file and exactly what is reported on Windows for this file with Validate 3.2.

Incidentally, my actual command is much more complicated than the example provided (and includes --pdf-error-dir, etc.), but I intentionally provided a minimal reproducible example.

Thanks again for your help!