UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.73k stars 241 forks source link

Read document structure and apply PDF accessibility tag? #873

Open gobsmack opened 3 months ago

gobsmack commented 3 months ago

I am working on a project to make existing PDF documents accessible. I am trying to switch from PdfSharp to PdfPig because it seems to handle metadata better. My goal is to read the document structure, and then add accessibility tags to the PDF.

I've got a pretty good start, using the PDFMerger. So, I can copy the input document to an output document with all the metadata. But I'm stuck on the tags (the whole point). I'm getting a tagged PDF. But there are no tags.

It looks like this is all written into PdfPig already. PdfPig can analyze the document structure. So, I wonder if there is a way to write the tags based on the document structure.

Am I missing something? Could somebody point me in the correct direction?

gobsmack commented 3 months ago

More specifically, I wonder if it's possible to create the tag structure based on the bookmark structure.

EliotJones commented 1 month ago

Sorry it has been so long since I last worked with PDFs I don't recall what is and isn't available. On that basis I think the library as-is probably doesn't currently support this. Editing is not really full-featured yet. PDFSharp may have a better editing story here.

I think if tags are per-page it would be possible to insert them directly but I assume the tags you're referring to are a document level structure like AcroForms or Bookmarks. Unfortunately there's no API support for writing custom objects at this level yet. The 2 paths to enable such a thing would be support for tags à la PdfDocumentBuilder.CreateBookmarkTree. Or support for writing arbitrary PDF objects in PdfDocumentBuilder which would probably be a fairly well-isolated change, it would just require being able to call context.WriteToken(...) for some list of objects attached to the builder, and add the required key to catalog or trailer dictionaries which would be required to plug in the user's desired functionality. In this case it sounds like you need to set both MarkInfo and StructTreeRoot properties in the catalog dictionary as well as write the actual StructTree. This is not something the library currently supports alas.