Open Mr0grog opened 7 years ago
Here’s an example of a small, hard to see change: Record: https://web-monitoring-db.herokuapp.com/api/v0/pages/2d2ccc52-f467-4775-a034-bea5271c0b9f Version A: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/74346-6228877/version-11512540.pdf Version B: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/74346-6228877/version-11522529.pdf
Here’s an interesting graphic page with changes: Record: https://web-monitoring-db.herokuapp.com/api/v0/pages/c0307603-0bae-4a6c-bf12-52cc6482b0bc Version A: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/71555-6026691/version-9608983.pdf Version B: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/71555-6026691/version-11239564.pdf
Here’s one that’s just hard to scan by eye because it’s mostly reams of data: Record: https://web-monitoring-db.herokuapp.com/api/v0/pages/3edef8ea-de3f-4771-89f2-92840dad026b Version A: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/74013-6199243/version-9920428.pdf Version B: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/74013-6199243/version-10713675.pdf
And another: Record: https://web-monitoring-db.herokuapp.com/api/v0/pages/563b013c-883f-4099-8c98-ce6059a0b823 Version A: https://edgi-versionista-archive.s3.amazonaws.com/versionista2/74279-6212866/version-11023958.pdf Version B: https://edgi-versionista-archive.s3.amazonaws.com/versionista2/74279-6212866/version-11255938.pdf
I'm looking into this; see how I get on!
@neiljp Awesome, thanks so much!
I have the second library working for all 4 examples you listed. I did have minor issues with the 3rd one, which is large, has an offset top page and had lots of extraneous characters which I needed to figure out how to filter out. Would it be helpful to show these images somewhere?
Sure! Go ahead and post them here. If you have this work in a repo, go ahead and link it, too.
Are you on the Archivers Slack group? There’s more “live” conversation there and workflow, process, etc.
I'm generally not on Slack; is there an IRC mirror somewhere?
These are the results I have for the 4 tests, with the caveats as above:
These are wonderful. :thumbsup: 🎉
Unfortunately, I don’t think there is any mirror of the Slack :\
I did have minor issues with the 3rd one, which is large, has an offset top page and had lots of extraneous characters which I needed to figure out how to filter out.
No worries. I should have been clearer that this doesn’t have to be perfect. Even if there are false positives, being able to identify space people can definitely ignore is a big deal. This is super, super helpful.
Hey @neiljp this looks great, thx. Great to have new people stepping in!
We have been talking about an IRC bridge for a while but haven't set one up - doh!
@neiljp I’m headed out for the night, but will be back on tomorrow at 9-ish Pacific Time if you are planning to do more work on it. I will also try and sign into the global sprint Gitter.im if you are using that (I did not do a good job of paying attention to it today, sorry).
Looking forward to getting this integrated as a running service!
@Mr0grog I'm back and on the gitter chat now. Re chat: I'm currently on IRC (freenode, oftc), matrix.org and also experimenting with zulip (after some pycon sprints). While I'm moving on with looking into this, were other online services looked at? Or is it that they cannot be deployed with different resource limitations, for example?
were other online services looked at? Or is it that they cannot be deployed with different resource limitations, for example?
No—diffing PDFs is something that we simply haven't had time to get to at all yet.
In general, we haven’t found any great diffing services that either we can deploy feasibly or third party ones that we can integrate with and easily display the diff results in our own UI alongside forms and other visualizations for analysts.
Progress today has my flask implementation (locally) working with the library and generating a png in the browser; how would you deploy that? I could try and deploy to a server I have access to, in theory.
We don’t have a great deploy process for anything that’s not Heroku yet—it’s very ad-hoc on Amazon EC2. If you can deploy to a server you manage and document the process, that’d be great.
Apparently flask works on heroku; the trick might be installing the other module(s), including one that I built as binary, though might not strictly need to be.
Ah, yeah, binaries can be complicated on Heroku. You have to create a “buildpack:” https://devcenter.heroku.com/articles/buildpacks
@neiljp Did you get anywhere on this? If not, do you mind posting what code you’ve got somewhere so others can help on this? Thanks!
@Mr0grog I didn't get any further than getting it to work locally in the end, but have submitted some PRs against the lib I used, and hope to document the process ASAP.
@neiljp Any updates on this?
@Mr0grog Apologies, I got swept up in contributing to Zulip after PyCon. I'm now getting back to this, though I note there is other progress?
@neiljp Yeah, we sorta have a more defined way to do this now. You can add your work as a module in the https://github.com/edgi-govdata-archiving/web-monitoring-processing repo, in the web_monitoring
folder. There’s not much documentation on how the built-in diff server there works yet, but you can look at PR #59 in that repo. @danielballan can probably also help you out.
Hey, @neiljp, just checking in. Any updates or anything I can help with here?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.
Well, this is still pretty critical. It would be lovely to get some help from someone on this, but it does need to get done.
Hey if the issue still alive, I will like to contribute.
Hey @cYph3r1337, that would be great. These days, all the diff-related code lives in the web-monitoring-processing repo in the web_monitoring/diff
directory.
You can then make your differ accessible via HTTP by adding it to the server here: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/e77cc3cb56b9d66c82e3ad59f071d9d12b87254a/web_monitoring/diff_server/server.py#L30-L53 Basically, this just maps a part of the URL path to a function. The server will examine your argument names to figure out what to send it. More info on that here: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/e77cc3cb56b9d66c82e3ad59f071d9d12b87254a/web_monitoring/diff_server/server.py#L455-L465
We have a simplistic service for displaying diffs between two HTML pages (https://github.com/edgi-govdata-archiving/go-calc-diff), but we also see a lot of PDFs on government websites and would love to have a similar service for visualizing the diff between two versions of a PDF.
This should be a simple web service that takes two query arguments:
a
: A URL for the “before” version of the PDFb
: A URL for the “after” version of the PDFIt can take any additional arguments that might make sense. It can produce an image, an HTML page, a PDF, or anything that can be rendered by most web browsers as an HTTP response.
If you need it to function in a different way to be feasible, let’s talk about it! We can make other interfaces work so long as they can be accessible as a web service.
Some open source libraries for diffing PDFs that might be useful: