coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.35k stars 1.84k forks source link

Avoid duplicate images #569

Open Manouchehri opened 9 years ago

Manouchehri commented 9 years ago

After running pdf2htmlEX --embed cfijo --split-pages 1 [filename], I noticed there was quite a few duplicate images. Running fdupes . shows:

./bg17.png                              
./bg16.png

./bg2.png
./bg4.png
./bg5.png
./bg6.png
./bg7.png
./bg12.png
./bg13.png
./bg14.png
./bg15.png
./bg1c.png
./bg1f.png
./bg21.png
./bg20.png
./bg22.png
./bg24.png
./bg23.png
./bg25.png
./bg28.png
./bg2a.png

At the moment I'm using a bash script to create soft links for duplicate files, but it would be nice if pdf2htmlEX would do that automatically. It saves a bit of space and bandwidth, and I don't think it's too different to implement. (If you're wondering why I didn't use hard links, it's because Git does not handle them.)

#!/bin/bash
fdupes -r -1 -n "$@" | sed -e 's/\(\w\) /\1|/g' -e 's/|$//' > files.dup.list.txt
while read line; do
        IFS='|' read -a arr <<< "$line"
        orig=${arr[0]}
        for ((i = 1; i < ${#arr[@]}; i++)); do
                file="${arr[$i]}"
                ln -sf "$orig" "$file"
        done 
done < files.dup.list.txt
rm files.dup.list.txt
duanyao commented 8 years ago

Duplicated background images should be not very common -- though it is a waste when it does occur. I suggest you try --bg-format svg --svg-embed-bitmap 0 and see if there are still duplicated images.

It should be not hard to de-duplicate background images in pdf2htmlEx itself. It may compute checksum of each generated background images and check if there is a duplicated one. Would you like to implement this feature?