edgi-govdata-archiving / web-monitoring-diff

Tools for diffing and comparing web content. Also includes a web server that makes diffs available as an HTTP service.
https://web-monitoring-diff.readthedocs.io/
GNU General Public License v3.0
10 stars 3 forks source link

Add differ for word docs? #7

Open Mr0grog opened 6 years ago

Mr0grog commented 6 years ago

We don’t have a lot of Word docs in our DB, but there are a few and Analysts have noted that they are a pain. That said, we aren’t any worse than the existing tool (Versionista), plus we can do edgi-govdata-archiving/web-monitoring-ui#186, so this isn’t a high priority.

I don’t know if there are any great Linux tools out there for rendering a .doc file, but there certainly a few libraries that can handle .docx, like Mammoth: https://github.com/mwilliamson/python-mammoth, which can convert to HTML, Markdown, or plain text, any of which we could then diff with existing algorithms.

We could also use a service like Zamzar to convert, then diff.

danielballan commented 6 years ago

I would guess that handling .doc sufficiently to get a readable diff sounds is a chore beyond our current capacity, but .docx -> HTML seems easy enough to add.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

Mr0grog commented 5 years ago

This is more of a long-term idea. Would be great to have someone jump in and take a cut at it.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

Mr0grog commented 5 years ago

Keeping this open as a call for contributions. We probably don’t have the capacity for this right now, but if you’re interested in diffing and would like to take a shot at writing a function that can diff .docx files, we’d love to integrate it!