Diff PDFs by converting them to HTML and diffing that

edgi-govdata-archiving / web-monitoring-diff

Tools for diffing and comparing web content. Also includes a web server that makes diffs available as an HTTP service.

https://web-monitoring-diff.readthedocs.io/

GNU General Public License v3.0

11 stars 4 forks source link

Diff PDFs by converting them to HTML and diffing that #9

Open Mr0grog opened 6 years ago

Mr0grog commented 6 years ago

This should not necessarily replace a differ for actual PDF content, but it could potentially be a lot more useful when it works well: instead of trying to diff two PDF files, convert the PDF to HTML (there are at least a few open-source libraries for this) and feed that through the HTML differ.

Not sure what the right name for this is.

Lizz brought this up in Slack and, though I remember having a short discussion about the idea before, I can’t find anywhere we’ve written it down, hence this issue.

Mr0grog commented 6 years ago

Potentially useful article on PDF text extraction in Python I ran across today: https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

Mr0grog commented 5 years ago

Definitely still a relevant idea.