Someone finally sent me some PDFs that have duplicated pages where the pages metadata got lost.
Here, the best way of identifying duplicates would probably be:
Comparing text (easy one, me thinks)
Comparing the visual output, and preferably checking if a lot of pixels have either become darker (usual slides: white background, dark foreground) or lighter (dark theme slides). This is rather slow. (Use a library like https://github.com/mapbox/pixelmatch )
Someone finally sent me some PDFs that have duplicated pages where the pages metadata got lost.
Here, the best way of identifying duplicates would probably be: