coherentgraphics / cpdf-binaries

PDF Command Line Tools binaries for Linux, Mac, Windows
Other
581 stars 42 forks source link

Using -split-bookmarks with PDF files with very long bookmark text #43

Open reasonableperson opened 4 years ago

reasonableperson commented 4 years ago

Some PDFs contain bookmarks with very long text, like this rather silly one where bookmarks are named after the first 255 characters of text in each paragraph. This means the @B parameter cannot be used with cpdf's -split-bookmarks command, at least if you want to add a .pdf extension to your output or add the bookmark number as a prefix, because most filesystems do not support filenames with a length of over 255 characters.

$ curl https://www.supremecourt.uk/cases/docs/uksc-2019-0192-judgment.pdf -o in.pdf
$ cpdf in.pdf -split-bookmarks 0 -o '@B.pdf'
For non-commercial use only
To purchase a license visit http://www.coherentpdf.com/

1. It is important to emphasise that the issue in these appeals is not when and on what terms the United Kingdom is to leave the European Union. The issue is whether the advice given by the Prime Minister to Her Majesty the Queen on 27th or 28th Augus....pdf: File name too long

As a workaround, I am going to try parsing the output of -list-bookmarks and using it to repeatedly call cpdf in.pdf <bookmark-i-page-number>-<bookmark-i+1-page-number> <truncated-filename>, but that means I am manually reimplementing much of what's already done by -split-bookmarks. If there was some way to truncate the result of the @B parameter without writing my own script, that'd be great.

johnwhitington commented 4 years ago

We would be happy to add a -truncate option for a commercial customer.

The other option is to split bookmarks producing serial numbers 001.pdf, 002.pdf and then rename those according to the output of -list-bookmarks. But that may be no easier than your suggestion above.

johnwhitington commented 4 months ago

Had a quick look at this, and it seems that UTF8 is not as easy as it seems. Properly:

https://metacpan.org/pod/Unicode::Truncate

An easier version, which could break grapheme clusters but which at least produce a valid UTF8 string is given here:

https://stackoverflow.com/questions/35328529/stdstring-optimal-way-to-truncate-utf-8-at-safe-place