PhilterPaper / Perl-PDF-Builder

Extended version of the popular PDF::API2 Perl-based PDF library for creating, reading, and modifying PDF documents
https://www.catskilltech.com/FreeSW/product/PDF%2DBuilder/title/PDF%3A%3ABuilder/freeSW_full
Other
6 stars 7 forks source link

[CTS 17] Extract ADA compliant PDFs by page range. #76

Open neqelr17 opened 6 years ago

neqelr17 commented 6 years ago

Need to extract a page range from an ADA compliant PDF and not damage or lose the ADA compliance.

PhilterPaper commented 6 years ago

Per our exchange of emails on the subject, this concerns PDF "accessibility", and more specifically, the use of tags to mark semantics in a document (there are other aspects of accessibility). James reports that all the tools he's tried, while successfully extracting page ranges into new documents, break or corrupt the tags and thus accessibility.

I have asked him if he could supply a sample multipage tagged PDF (publicly viewable) that I could try out. I'm looking to see if anyone has written a good page extractor for PDF::API2 that I could port to PDF::Builder, and then see what it does to the tags. That would be a good starting point. I'll also try one or two tools I have (e.g., PDFtk). If others could try splitting a tagged PDF and report on other tools, that would be welcome, especially if they are open source and we can look at the algorithm.

Good information: Adobe's PDF Accessibility Overview . Sections 14.6-14.9 of the PDF-1.7 document seem to cover this area, and it appears to date back to at least PDF-1.4 (according to a quick skim).

PhilterPaper commented 6 years ago

While doing some work on encodings and various font types (core font, PS/T1 fonts, TTF/OTF), it appeared that TTF and OTF ("truetype") showed up in binary (i.e., not human readable). This could pose a problem if there's no way to get this text in readable form, so screen readers, Braille imprinters, etc. can do their thing. It's not definite that they can't, but consider this a "heads up". It's possible that producing "accessible" documents will restrict producers to core fonts and PS (Type1) fonts, which in turn limit them to 256 glyphs at a time.

Does anyone know if screen readers, etc. can handle text for TTF and OTF fonts in binary format, or alternately, if there is a way to use human-readable formatting in the PDF file?

PhilterPaper commented 6 years ago

This issue is partly an enhancement (new feature), for tagged structure, and partly a bug report for tags not being properly handled. See also RT 120375 (#52).