coherentgraphics / cpdf-binaries

PDF Command Line Tools binaries for Linux, Mac, Windows
GNU Affero General Public License v3.0
604 stars 42 forks source link

-split-bookmarks into UTF-8 "@B" titles #30

Closed jlcd closed 5 years ago

jlcd commented 6 years ago

I've read that the -split-bookmarks operation removes some characters, as per:

The bookmark text used for a name is converted from unicode to 7 bit ASCII, and the following
characters are removed, in addition to any character with ASCII code less than 32:
/ ? < > \ : * | " ˆ + =

Not sure why it was made this way, but are UTF-8 bookmark titles expected to be implemented in the near future? If not, may I leave this open as a feature request?

johnwhitington commented 6 years ago

Thanks for the report.

To be clear, this is just for the @B option which includes the name of a bookmark in a file. The bookmarks included in each split file are not affected.

It is a crude way to avoid writing filenames which contain special characters, which can be illegal or hard to wrangle on some systems.

So yes, we can fix this. It would work by adding the -utf8 flag, so that the change remains backward-compatible. We would still remove some special characters, such as newlines, but be UTF8-aware.

jlcd commented 6 years ago

Man you've a quick response.

Now I realize why you did this. Some characters may not be filename-friendly, so I agree 100% with you. I just now found out about the -utf8 flag (when I read about listing bookmarks). From my [limited] context on cpdf, I guess this would be indeed the flag to use when keeping bookmark-filenames [mostly] unchanged. It would be something like "force utf-8 filenames as I'm aware of the consequences".

Of course I don't expect this to be done lightning fast, but do you have an ETA for when this will reach a stable version? Asking just to know if I should work on some workarounds or wait.

And of course, many thanks for this excelent tool.

jlcd commented 6 years ago

Not sure if this is the proper way to generate UTF8 filenames (when the -utf8 flag is set), but I gave it a try: https://github.com/jlcd/cpdf-source/commit/a5e9f4dcc5e56cc1afd7fa493ed7680dad755f22#diff-a1ea83527a319a64f1e227a3add40e68

Never developed anything with OCaml, so I'm pretty sure there are some bits off place there.

Seems to be working for my scenario.


Edit:

Why my binary version takes roughly 3.5 times more to run than the binary from this repository? And I mean, even if I download the source, run make on the raw files, my cpdf ... -split-bookmarks takes ~20s, while this repository's binary takes ~6s.

This repo's binary:

root@21b14f4a4c40:/tmp# time ./cpdf2 -split-bookmarks 0 ./x.pdf -utf8 -o "./my/%%%%% @B.pdf" 

real    0m5.904s
user    0m4.060s
sys 0m0.390s

My version from source (untouched):

root@21b14f4a4c40:/tmp# time ./cpdf -split-bookmarks 0 ./x.pdf -utf8 -o "./my/%%%%% @B.pdf"

real    0m20.454s
user    0m17.910s
sys 0m0.680s
johnwhitington commented 6 years ago

Thanks! I'll take a detailed look soon.

(Speed: you somehow built the bytecode version not the native code version?)

jlcd commented 6 years ago

Not sure, I just ran make to compile it.

Should I have done it in any other way?

Edit3:

Ok, finally got it. The issue was that I was checking out v2.2.1 and not v2.2-patchlevel1. When I got camlpdf and cpdf both from v2.2-patchlevel1 I got the same quick result I was getting from the binaries within this repository. Steps to success:

root@21b14f4a4c40:/tmp/cpdf-source# opam remove cpdf

[...]

root@21b14f4a4c40:/tmp/cpdf-source# opam remove camlpdf

[...]

root@21b14f4a4c40:/tmp# git clone https://github.com/johnwhitington/camlpdf.git

[...]

root@21b14f4a4c40:/tmp# cd camlpdf/
root@21b14f4a4c40:/tmp/camlpdf# git checkout v2.2-patchlevel1

[...]

root@21b14f4a4c40:/tmp/camlpdf# make

[...]

root@21b14f4a4c40:/tmp/camlpdf# make install

[...]

root@21b14f4a4c40:/tmp# git clone https://github.com/johnwhitington/cpdf-source.git

[...]

root@21b14f4a4c40:/tmp# cd cpdf-source/
root@21b14f4a4c40:/tmp/cpdf-source# git checkout v2.2-patchlevel1

[...]

root@21b14f4a4c40:/tmp/cpdf-source# make

[...]

root@21b14f4a4c40:/tmp/cpdf-source# time ./cpdf -split-bookmarks 0 ../x.pdf -utf8 -o ../my/$RANDOM%%%%%@B.pdf

real    0m6.299s
user    0m4.160s
sys 0m0.610s

Below are some prior steps of what I tried to do. Leaving them here just in case it helps someone that came from Google.


Edit1: Pretty sure it's native:

make[1]: Entering directory '/tmp/cpdf-source'
ocamlfind ocamldep -native cpdfcommand.mli > ._ncdi/cpdfcommand.di
ocamlfind ocamldep -native cpdf.mli > ._ncdi/cpdf.di
ocamlfind ocamldep -native cpdfstrftime.mli > ._ncdi/cpdfstrftime.di
ocamlfind ocamldep -native xmlm.mli > ._ncdi/xmlm.di
ocamlfind ocamldep cpdfcommandrun.ml > ._d/cpdfcommandrun.d
ocamlfind ocamldep cpdfcommand.ml > ._d/cpdfcommand.d
ocamlfind ocamldep cpdf.ml > ._d/cpdf.d
ocamlfind ocamldep cpdfstrftime.ml > ._d/cpdfstrftime.d
ocamlfind ocamldep xmlm.ml > ._d/xmlm.d
ocamlfind ocamlc -package camlpdf -c -annot xmlm.mli
ocamlfind ocamlopt -package camlpdf -c -annot -g -w -3 -annot xmlm.ml
ocamlfind ocamlc -package camlpdf -c -annot cpdfstrftime.mli
ocamlfind ocamlopt -package camlpdf -c -annot -g -w -3 -annot cpdfstrftime.ml
ocamlfind ocamlc -package camlpdf -c -annot cpdf.mli
ocamlfind ocamlopt -package camlpdf -c -annot -g -w -3 -annot cpdf.ml
ocamlfind ocamlc -package camlpdf -c -annot cpdfcommand.mli
ocamlfind ocamlopt -package camlpdf -c -annot -g -w -3 -annot cpdfcommand.ml
ocamlfind ocamlopt -package camlpdf -c -annot -g -w -3 -annot cpdfcommandrun.ml
ocamlfind ocamlopt \
            -package camlpdf -linkpkg \
                    -g        -o cpdf \
            xmlm.cmx cpdfstrftime.cmx cpdf.cmx cpdfcommand.cmx cpdfcommandrun.cmx
make[1]: Leaving directory '/tmp/cpdf-source'
make[1]: Entering directory '/tmp/cpdf-source'
ocamlfind ocamlopt -a         -g       -o cpdf.cmxa xmlm.cmx cpdfstrftime.cmx cpdf.cmx cpdfcommand.cmx cpdfcommandrun.cmx
make[1]: Leaving directory '/tmp/cpdf-source'
make[1]: Entering directory '/tmp/cpdf-source'
ocamlfind ocamldep cpdfcommand.mli > ._bcdi/cpdfcommand.di
ocamlfind ocamldep cpdf.mli > ._bcdi/cpdf.di
ocamlfind ocamldep cpdfstrftime.mli > ._bcdi/cpdfstrftime.di
ocamlfind ocamldep xmlm.mli > ._bcdi/xmlm.di
ocamlfind ocamlc -package camlpdf -c -annot -g -w -3 -annot xmlm.ml
ocamlfind ocamlc -package camlpdf -c -annot -g -w -3 -annot cpdfstrftime.ml
ocamlfind ocamlc -package camlpdf -c -annot -g -w -3 -annot cpdf.ml
ocamlfind ocamlc -package camlpdf -c -annot -g -w -3 -annot cpdfcommand.ml
ocamlfind ocamlc -package camlpdf -c -annot -g -w -3 -annot cpdfcommandrun.ml
ocamlfind ocamlmktop \
            -package camlpdf -linkpkg \
                   -g        -o cpdf.top \
            xmlm.cmo cpdfstrftime.cmo cpdf.cmo cpdfcommand.cmo cpdfcommandrun.cmo
make[1]: Leaving directory '/tmp/cpdf-source'
mkdir -p doc/cpdf/html
rm -rf doc/cpdf/html/*
ocamlfind ocamldoc -package camlpdf -html -d doc/cpdf/html xmlm.mli cpdfstrftime.mli cpdf.mli cpdfcommand.mli

Edit2:

Just tested and had same slow result when I got cpdf from opam install cpdf:

root@21b14f4a4c40:/tmp/cpdf-source# opam info cpdf
             package: cpdf
             version: 2.2.1
          repository: default
        upstream-url: https://github.com/johnwhitington/cpdf-source/archive/v2.2.1.zip
       upstream-kind: http
   upstream-checksum: 5c0caa7bed9452cf7d1ed0492929824d
            homepage: http://github.com/johnwhitington/cpdf-source
         bug-reports: http://github.com/johnwhitington/cpdf-source/issues
            dev-repo: git://github.com/johnwhitington/cpdf-source
              author: John Whitington
             depends: ocamlfind & camlpdf >= 2.2.1
   installed-version: 2.2.1 [system]
  available-versions: 1.7, 2.1.1, 2.2.1
         description: High-level pdf tools based on CamlPDF

root@21b14f4a4c40:/tmp/cpdf-source# opam remove cpdf
The following actions will be performed:
 - remove    cpdf.2.2.1
=== 1 to remove ===

=-=- Removing Packages =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Removing cpdf.2.2.1.
  ocamlfind remove cpdf
root@21b14f4a4c40:/tmp/cpdf-source# opam info cpdf
             package: cpdf
             version: 2.2.1
          repository: default
        upstream-url: https://github.com/johnwhitington/cpdf-source/archive/v2.2.1.zip
       upstream-kind: http
   upstream-checksum: 5c0caa7bed9452cf7d1ed0492929824d
            homepage: http://github.com/johnwhitington/cpdf-source
         bug-reports: http://github.com/johnwhitington/cpdf-source/issues
            dev-repo: git://github.com/johnwhitington/cpdf-source
              author: John Whitington
             depends: ocamlfind & camlpdf >= 2.2.1
   installed-version: 
  available-versions: 1.7, 2.1.1, 2.2.1
         description: High-level pdf tools based on CamlPDF

root@21b14f4a4c40:/tmp/cpdf-source# opam install cpdf
The following actions will be performed:
 - install   cpdf.2.2.1
=== 1 to install ===

=-=- Synchronizing package archives -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

=-=- Installing packages =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Building cpdf.2.2.1:
  make
  make install
Installing cpdf.2.2.1.
root@21b14f4a4c40:/tmp/cpdf-source# time cpdf -split-bookmarks 0 ../x.pdf -utf8 -o ../my/$RANDOM%%%%%@B.pdf

real    0m17.665s
user    0m15.810s
sys 0m0.350s
root@21b14f4a4c40:/tmp/cpdf-source# 
johnwhitington commented 6 years ago

Can you give me the output of file cpdf in the slow OPAM case?

jlcd commented 6 years ago

Ok, just confirming, it really is way slower:

root@21b14f4a4c40:/tmp/cpdf-source# time ./cpdf -split-bookmarks 0 ../x.pdf -utf8 -o ../my/$RANDOM%%%%%@B.pdf

real    0m21.702s
user    0m17.220s
sys 0m1.890s
root@21b14f4a4c40:/tmp/cpdf-source# time ../cpdf -split-bookmarks 0 ../x.pdf -utf8 -o ../my/$RANDOM%%%%%@B.pdf

real    0m6.590s
user    0m4.600s
sys 0m0.310s

And the file <slower_cpdf> command output you asked:

root@21b14f4a4c40:/tmp/cpdf-source# file ./cpdf
./cpdf: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=490fc74a859c8c6931b0ab1aa0d207abaa092e2e, not stripped

And, just as it may help somehow, the output of file <faster_cpdf>:

root@21b14f4a4c40:/tmp/cpdf-source# file ../cpdf
../cpdf: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=cec462743cfda5b86c8bc7b2e2f9f9ffacc88b89, not stripped
johnwhitington commented 5 years ago

Fixed in forthcoming v2.3. Bookmark names for @B are stripped as before, unless -utf8 is supplied, in which case problematic characters and characters < 32 only are stripped. If -raw is supplied, the text is not processed at all (not recommended).