Detect duplicated pages by visually comparing them

fsinf / pdf-page-stripper

Strips useless pages from TU Wien PDFs

https://fsinf.github.io/pdf-page-stripper/

The Unlicense

11 stars 1 forks source link

Open stefnotch opened 1 year ago

stefnotch commented 1 year ago

Someone finally sent me some PDFs that have duplicated pages where the pages metadata got lost.

Here, the best way of identifying duplicates would probably be:

Comparing text (easy one, me thinks)
Comparing the visual output, and preferably checking if a lot of pixels have either become darker (usual slides: white background, dark foreground) or lighter (dark theme slides). This is rather slow. (Use a library like https://github.com/mapbox/pixelmatch )

stefnotch commented 1 year ago

stefnotch commented 1 year ago