criticalinfralab / criticalinfralab

print automation, website, backup scripts
GNU General Public License v3.0
2 stars 0 forks source link

Make PDF documents conform to PDF/UA-1 standard #1

Open u451f opened 3 weeks ago

u451f commented 3 weeks ago

Identified for now:

u451f commented 3 weeks ago

Testing only the contents of CIL006: Specification: ISO 14289-1:2014, Clause: 7.18.5, Test number: 1

Links shall be tagged according to ISO 32000-1:2008, 14.8.4.4.2, Link Element   Failed
22 occurrences
PDLinkAnnot 
structParentStandardType == 'Link' || isOutsideCropBox == true || (F & 2) == 2  
root/document[0]/pages[0](6 0 obj PDPage)/annots[0](7 0 obj PDLinkAnnot)

→ I suspect that is related to the links in the table of contents. Likely we have not much influence on these, and we need to open a bug report for Weasyprint, once we confirm my hunch is indeed correct. → This is confirmed. It's related to the generation of the ToC.

u451f commented 3 weeks ago

Another report was that the language tag was missing, which is fixed by adding the lang element to the metadata, as done in 87763dc. This needs to be documented.

u451f commented 3 weeks ago

Helpful document: https://pdfa.org/wp-content/uploads/2019/06/TaggedPDFBestPracticeGuideSyntax.pdf

u451f commented 3 weeks ago

Another error: Specification: ISO 14289-1:2014, Clause: 7.2, Test number: 19

I did not (yet) find out which are these elements, i probably lack the right tool to look into it more in detail.
L element may contain only L, LI and Caption elements   Failed
9 occurrences

Totally unclear what this means. Check with pandoc -s -t native CIL006/3main.md

u451f commented 3 weeks ago

Specification: ISO 14289-1:2014, Clause: 7.4.2, Test number: 1

For documents that are not strongly structured, as described in ISO 32000-1:2008, 14.8.4.3.5, heading tags shall be used as follows: (*) If any heading tags are used, H1 shall be the first. (*) A document may use more than one instance of any specific tag level. For example, a tag level may be repeated if document content requires it. (*) If document semantics require a descending sequence of headers, such a sequence shall proceed in strict numerical order and shall not skip an intervening heading level. (*) A document may increment its heading sequence without restarting at H1 if document semantics require it   Failed
3 occurrences

→ This could be due to the fact that we use H6 for pulled quotes. But needs to be confirmed