Smart quotes and em dash

AndreRenaud / PDFGen

Simple C PDF Writer/Generation library

The Unlicense

493 stars 118 forks source link

Smart quotes and em dash #102

Closed stephenberry closed 2 years ago

stephenberry commented 2 years ago

Thanks for this nice pdf library. I'm trying to write some pretty standard text to pdf, but I'm running into issues with the limited UTF8 support. What is required to add support for a few more characters?

Specifically, I'd like smart quotes and em dash support.

AndreRenaud commented 2 years ago

Hey. Adding support for those shouldn't be too bad. Unfortunately PDF text is not encoded as UTF-8 if you use the standard fonts, it has its own encoding. If you look in Appendix D of the spec (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf) you can see what characters there are, and what their encoding is. I see that emdash & left/right quotes are in there.

I've had a go at adding the support - can you check out https://github.com/AndreRenaud/PDFGen/tree/smart-quotes and tell me if it fixes things? Let me know if there are other characters, or feel free to add them yourself & send it through. You can see the change that I made there to the switch statement to add them.

stephenberry commented 2 years ago

Thanks for the reference and fast reply. This works for pdf_add_text_spacing, but there is a problem with pdf_add_text_wrap in that it calls pdf_text_point_width, which triggers the following error:

if (code >= 255)
            return pdf_set_err(
                pdf, code_len,
                "Unable to determine width of character code %d", code);

The problem is that the width tables for the various fonts are limited to 1 byte values. I couldn't find the character widths described in the pdf handbook. I think a separate length array that matches the minimal UTF-8 characters supported will be needed. If you could point me to a resource I'm happy implementing the fix.

AndreRenaud commented 2 years ago

Ah, that's a good point. In fact, that check is broken really, it's checking the UTF-8 code against what should be a PDF-encoded table.... but it is going to require a bit of work to fix up. In the mean time, for you own testing, you could just return some constant value there instead of an error (ie: return the width of a 'W' or something), at least it would let you keep going.

I'll see if I can work out how to rationalise this so that it's consistent... but I'm not 100% sure when I'll be able to get to it.

stephenberry commented 2 years ago

No rush, thanks for the feedback, I'll probably hack a small table right now as an em-dash and an apostrophe have significantly different widths. Thanks!

AndreRenaud commented 2 years ago

I had another poke around, and it wasn't that bad. It turns out the font widths as they stand prior to this branch are actually wrong in a few cases anyway. Can you try the latest code on the above branch and see if it behaves better?

stephenberry commented 2 years ago

That update is working great! I tested an edge case briefly where I had a long line of em-dashes to wrap, and I got a negative code value because I think it is probably wrapping within the unicode value. I'll test more later, but I'd recommend looking at the wrapping algorithm and making sure unicode values won't get split. Thanks again!

AndreRenaud commented 2 years ago

Thanks for noticing that. I've pushed a possible fix - can you recheck? I've also added that as part of the example program, and it appears to break properly with a long line of emdashes.

stephenberry commented 2 years ago

I checked your fix for wrapping UTF8 and it worked great!