Open aspence23 opened 4 years ago
Some context from the user forum: https://groups.google.com/d/msg/archivematica/AQv5Ltj0_2s/f19eZou9AwAJ
Another (larger) issue is that of pdf/a validation within Archivematica- using veraPDF would be a positive step (possibly facilitated by MediaConch?)
Questions: All PDFs or ones with only color? So if PDFs do not designate color space RGB or CYK then the PDF/A does not conform?
Just came across this issue because we were noticing some problems with the pdf/a's normalized by archivematica and did some reading up on the issue. Just to summarize, if I have this correctly, ghostscript out-of-the box doesn't have a valid ICC profile to create pdf/a's and requires a valid one be manually added. I'm not sure how this could be integrated into Archivematica, but I found this script that takes care of that specific ghostscript shortcoming (adding the icc profile) and can be easily used to create more likely valid pdf/a's. I've just used it now to manually normalize a whole batch of pdf's.
Noting a variation of this issue that we've come across. In our case, a PDF has a missing glyph in the OCR font and Ghostscript normalizes to a PDF1.7 that is not PDF/A compliant.
Our concern is that the normalization report returns "Preservation normalization failed: No" when "transcoding to pdfa" has in fact failed.
Expected behaviour Normalization report returns "Preservation normalization failed: Yes" when Transcoding to pdfa with Ghostscript command does not generate a PDF/A compliant copy.
Current behaviour Using the default command, the normalization report returns "Preservation normalization failed" as "No" even though the task output contains the following stderr:
GPL Ghostscript 9.25: Missing glyph CID=0, glyph=007e in the font HiddenHorzOCR . The output PDF may fail with some viewers.
GPL Ghostscript 9.25: All used glyphs mst be present in fonts for PDF/A, reverting to normal PDF output.
A preservation derivative is generated but it is a PDF 1.7 that fails veraPDF validation.
It would be great if:
We are also testing Ghostscript and the result as valid PDF-s with VeraPDF. Same result as already described. What we can see though is that changes in Ghostscript is being implemented, https://ghostscript.readthedocs.io/en/latest/Readme.html will this give changes in Archivematicas use of Ghostscript?
We're doing some FPR work in 1.17, maybe this is a good moment to update Ghostscript- @replaceafill , @Dhwaniartefact do you have any thoughts?
From further conversations with the devs I now understand that upgrading Ghostscript isn't really the issue because we don't package specific Ghostscript versions for releases, we use (by default) whichever version is in the operating system.
We would however like to do some experimentation with veraPDF with an eye to possibly getting it used within Archivematica for pdf/a validation which seems like a useful outcome. @karinbredenberg , do you have a sample file that we could use (you could email to me?) We can keep it for internal testing use only if it's not something that can be made public.
That was good to get to know that its the Ghostscript that is in the operating system. That might mean that the command calling needs to be updated dependning on the version that is in the operating system and the icc profiles added/updated and also handled through the command. Something to look some more at.
For testing the the normalization and then making the validation @sromkey we have been using the veraPDF corpus, https://github.com/veraPDF/veraPDF-corpus/tree/staging and the pdf's for some of the specifications (sip, aip, dip, csip and some guidelines like the one for CITS SIARD) from the DILCIS Board to get public available documents. https://github.com/DILCISBoard
Expected behaviour pdf/a files generated by Ghostscript command during normalisation microservice pass validation checks in veraPDF.
Current behaviour A number of pdf/a files generated by Ghostscript command in Archivematica fail validation in veraPDF and are not considered compliant with the pdf/a specification. Colour space issues are the main problem.
Steps to reproduce Ingest a pdf version 1.4, 1.5 or 1.6. Choose to normalise it for preservation. Verify the normalised version of the file generated by Archivematica using the veraPDF tool.
Your environment (version of Archivematica, operating system, other relevant details) Version 1.9.2
For Artefactual use:
Before you close this issue, you must check off the following: