dasch-swiss / sipi

Simple Image Presentation Interface
https://sipi.io
GNU Affero General Public License v3.0
32 stars 8 forks source link

Validate PDF/A files #285

Open benjamingeer opened 5 years ago

benjamingeer commented 5 years ago

This is the only PDF/A validation library I found that looks serious and well-maintained:

https://verapdf.org/home/

benjamingeer commented 5 years ago

@mrivoal @loicjaouen @gfoo Lukas says we're not going to do this. You have to validate your own PDFs before putting them into Sipi. You could use https://verapdf.org/home/ for that.

mrivoal commented 5 years ago

I guess we could do that by ourselves now. But at some point (and I guess sooner than later), when a user can upload a PDF to Sipi using Salsah or KUIRL, his PDF should be validated somehow, isn't it?

So, the idea is that Sipi only hosts validated PDF/A?

benjamingeer commented 5 years ago

If I understood correctly, @lrosenth said each project is responsible for ensuring that its data is valid before upload/import. Perhaps a GUI could handle validation before submitting the file to Sipi.

lrosenth commented 5 years ago

Hi to all from Corsica...

Validating a PDF/A is a complex process. I could imagine that I could use the ghostscript-library to di it, but it will be a lot of work. Acrobat allows to validate PDFs before upload. So I suggest for the moment its up to the user. Later - as i mentioned - we can add some validation...

Lukas

Von meinem iPhone gesendet

Am 05.07.2019 um 09:58 schrieb Benjamin Geer notifications@github.com<mailto:notifications@github.com>:

If I understood correctly, @lrosenthhttps://github.com/lrosenth said each project is responsible for ensuring that its data is valid before upload/import. Perhaps a GUI could handle validation before submitting the file to Sipi.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dhlab-basel/Sipi/issues/285?email_source=notifications&email_token=ABJX3TEOUL75MR4QRYSNXMTP535JHA5CNFSM4G3265GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZI2W3I#issuecomment-508668781, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABJX3TCH3G77PDK5WUQESLDP535JHANCNFSM4G3265GA.

subotic commented 5 years ago

I think that Sipi needs to do this. In the context of long-term preservation, this is a must-have. But as @lrosenth said, it will take a bit longer to actually implement.

mrivoal commented 5 years ago

Ok, fine. We will do our own PDF validation for now.

But at some point, Sipi should definitely check during the upload every file format we are going to accept.

mrivoal commented 5 years ago

@lrosenth : a question regarding the PDF formats we are willing to store and preserve though Sipi.

(As we are going to convert some 1200 PDF files for Lumières.Lausanne, let's choose the right version/format!)


Here are the comments from the CINES on PDF/A versions:

Format Commentaire
PDFA 1a Basé sur PDF 1.4 mais plus restrictive : pas de dépendances externes, polices embarquées, pas de transparence, métadonnées XMP obligatoires. C'est le format d'archivage à privilégier bien que difficile à générer.
PDF 1.4 Basé sur PDF 1.4 - moins exigeant que 1a, structure logique du document non obligatoire. Bon format d'archivage si PDFA-1a trop compliqué à générer.
PDFA 2a Basé sur PDF 1.7 - fichier PDF/A embarquable, structure logique obligatoire.
PDFA 2u PDF adapté à l'accessibilité.
PDFA 2b Basé sur PDF 1.7, identique à PDF/A-2b sans structure logique obligatoire.
PDFA 3a Basé sur PDF 1.7 - fichier de n'importe quel format embarquable, structure logique obligatoire. Format axé sur l'accessibilité