Open benjamingeer opened 5 years ago
@mrivoal @loicjaouen @gfoo Lukas says we're not going to do this. You have to validate your own PDFs before putting them into Sipi. You could use https://verapdf.org/home/ for that.
I guess we could do that by ourselves now. But at some point (and I guess sooner than later), when a user can upload a PDF to Sipi using Salsah or KUIRL, his PDF should be validated somehow, isn't it?
So, the idea is that Sipi only hosts validated PDF/A?
If I understood correctly, @lrosenth said each project is responsible for ensuring that its data is valid before upload/import. Perhaps a GUI could handle validation before submitting the file to Sipi.
Hi to all from Corsica...
Validating a PDF/A is a complex process. I could imagine that I could use the ghostscript-library to di it, but it will be a lot of work. Acrobat allows to validate PDFs before upload. So I suggest for the moment its up to the user. Later - as i mentioned - we can add some validation...
Lukas
Von meinem iPhone gesendet
Am 05.07.2019 um 09:58 schrieb Benjamin Geer notifications@github.com<mailto:notifications@github.com>:
If I understood correctly, @lrosenthhttps://github.com/lrosenth said each project is responsible for ensuring that its data is valid before upload/import. Perhaps a GUI could handle validation before submitting the file to Sipi.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dhlab-basel/Sipi/issues/285?email_source=notifications&email_token=ABJX3TEOUL75MR4QRYSNXMTP535JHA5CNFSM4G3265GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZI2W3I#issuecomment-508668781, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABJX3TCH3G77PDK5WUQESLDP535JHANCNFSM4G3265GA.
I think that Sipi needs to do this. In the context of long-term preservation, this is a must-have. But as @lrosenth said, it will take a bit longer to actually implement.
Ok, fine. We will do our own PDF validation for now.
But at some point, Sipi should definitely check during the upload every file format we are going to accept.
@lrosenth : a question regarding the PDF formats we are willing to store and preserve though Sipi.
Are we only going accept PDF/A or will we also accept other well formed and validated versions of PDF? I am asking because other archives (such as the CINES, in France) accept other versions of PDF, provided the file are validated against the declared format.
If we only accept PDF/A , what is the prefered version of PDF/A we should accept? The CINES seems to prefer PDF/A 1a for archiving, but what do you think?
(As we are going to convert some 1200 PDF files for Lumières.Lausanne, let's choose the right version/format!)
Here are the comments from the CINES on PDF/A versions:
Format | Commentaire |
---|---|
PDFA 1a | Basé sur PDF 1.4 mais plus restrictive : pas de dépendances externes, polices embarquées, pas de transparence, métadonnées XMP obligatoires. C'est le format d'archivage à privilégier bien que difficile à générer. |
PDF 1.4 | Basé sur PDF 1.4 - moins exigeant que 1a, structure logique du document non obligatoire. Bon format d'archivage si PDFA-1a trop compliqué à générer. |
PDFA 2a | Basé sur PDF 1.7 - fichier PDF/A embarquable, structure logique obligatoire. |
PDFA 2u | PDF adapté à l'accessibilité. |
PDFA 2b | Basé sur PDF 1.7, identique à PDF/A-2b sans structure logique obligatoire. |
PDFA 3a | Basé sur PDF 1.7 - fichier de n'importe quel format embarquable, structure logique obligatoire. Format axé sur l'accessibilité |
This is the only PDF/A validation library I found that looks serious and well-maintained:
https://verapdf.org/home/