archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: Ghostscript command generates pdf/a files which fail veraPDF validation #1220

Open aspence23 opened 4 years ago

aspence23 commented 4 years ago

Expected behaviour pdf/a files generated by Ghostscript command during normalisation microservice pass validation checks in veraPDF.

Current behaviour A number of pdf/a files generated by Ghostscript command in Archivematica fail validation in veraPDF and are not considered compliant with the pdf/a specification. Colour space issues are the main problem.

Steps to reproduce Ingest a pdf version 1.4, 1.5 or 1.6. Choose to normalise it for preservation. Verify the normalised version of the file generated by Archivematica using the veraPDF tool.

Your environment (version of Archivematica, operating system, other relevant details) Version 1.9.2


For Artefactual use:

Before you close this issue, you must check off the following:

sromkey commented 4 years ago

Some context from the user forum: https://groups.google.com/d/msg/archivematica/AQv5Ltj0_2s/f19eZou9AwAJ

sromkey commented 4 years ago

Another (larger) issue is that of pdf/a validation within Archivematica- using veraPDF would be a positive step (possibly facilitated by MediaConch?)

gnosisgithub commented 4 years ago

Questions: All PDFs or ones with only color? So if PDFs do not designate color space RGB or CYK then the PDF/A does not conform?

fitnycdigitalinitiatives commented 3 years ago

Just came across this issue because we were noticing some problems with the pdf/a's normalized by archivematica and did some reading up on the issue. Just to summarize, if I have this correctly, ghostscript out-of-the box doesn't have a valid ICC profile to create pdf/a's and requires a valid one be manually added. I'm not sure how this could be integrated into Archivematica, but I found this script that takes care of that specific ghostscript shortcoming (adding the icc profile) and can be easily used to create more likely valid pdf/a's. I've just used it now to manually normalize a whole batch of pdf's.

shij13 commented 8 months ago

Noting a variation of this issue that we've come across. In our case, a PDF has a missing glyph in the OCR font and Ghostscript normalizes to a PDF1.7 that is not PDF/A compliant.

Our concern is that the normalization report returns "Preservation normalization failed: No" when "transcoding to pdfa" has in fact failed.

Expected behaviour Normalization report returns "Preservation normalization failed: Yes" when Transcoding to pdfa with Ghostscript command does not generate a PDF/A compliant copy.

Current behaviour Using the default command, the normalization report returns "Preservation normalization failed" as "No" even though the task output contains the following stderr:

GPL Ghostscript 9.25: Missing glyph CID=0, glyph=007e in the font HiddenHorzOCR . The output PDF may fail with some viewers.
GPL Ghostscript 9.25: All used glyphs mst be present in fonts for PDF/A, reverting to normal PDF output.

A preservation derivative is generated but it is a PDF 1.7 that fails veraPDF validation.

It would be great if:

karinbredenberg commented 3 months ago

We are also testing Ghostscript and the result as valid PDF-s with VeraPDF. Same result as already described. What we can see though is that changes in Ghostscript is being implemented, https://ghostscript.readthedocs.io/en/latest/Readme.html will this give changes in Archivematicas use of Ghostscript?

sromkey commented 3 months ago

We're doing some FPR work in 1.17, maybe this is a good moment to update Ghostscript- @replaceafill , @Dhwaniartefact do you have any thoughts?

sromkey commented 3 months ago

From further conversations with the devs I now understand that upgrading Ghostscript isn't really the issue because we don't package specific Ghostscript versions for releases, we use (by default) whichever version is in the operating system.

We would however like to do some experimentation with veraPDF with an eye to possibly getting it used within Archivematica for pdf/a validation which seems like a useful outcome. @karinbredenberg , do you have a sample file that we could use (you could email to me?) We can keep it for internal testing use only if it's not something that can be made public.

karinbredenberg commented 3 months ago

That was good to get to know that its the Ghostscript that is in the operating system. That might mean that the command calling needs to be updated dependning on the version that is in the operating system and the icc profiles added/updated and also handled through the command. Something to look some more at.

For testing the the normalization and then making the validation @sromkey we have been using the veraPDF corpus, https://github.com/veraPDF/veraPDF-corpus/tree/staging and the pdf's for some of the specifications (sip, aip, dip, csip and some guidelines like the one for CITS SIARD) from the DILCIS Board to get public available documents. https://github.com/DILCISBoard