edgi-govdata-archiving / web-monitoring

Documentation and project-wide issues for the Website Monitoring project (a.k.a. "Scanner")
Creative Commons Attribution Share Alike 4.0 International
105 stars 17 forks source link

Create a service to diff PDF files #36

Open Mr0grog opened 7 years ago

Mr0grog commented 7 years ago

We have a simplistic service for displaying diffs between two HTML pages (https://github.com/edgi-govdata-archiving/go-calc-diff), but we also see a lot of PDFs on government websites and would love to have a similar service for visualizing the diff between two versions of a PDF.

This should be a simple web service that takes two query arguments:

It can take any additional arguments that might make sense. It can produce an image, an HTML page, a PDF, or anything that can be rendered by most web browsers as an HTTP response.

If you need it to function in a different way to be feasible, let’s talk about it! We can make other interfaces work so long as they can be accessible as a web service.

Some open source libraries for diffing PDFs that might be useful:

Mr0grog commented 7 years ago

Here’s an example of a small, hard to see change: Record: https://web-monitoring-db.herokuapp.com/api/v0/pages/2d2ccc52-f467-4775-a034-bea5271c0b9f Version A: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/74346-6228877/version-11512540.pdf Version B: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/74346-6228877/version-11522529.pdf

Here’s an interesting graphic page with changes: Record: https://web-monitoring-db.herokuapp.com/api/v0/pages/c0307603-0bae-4a6c-bf12-52cc6482b0bc Version A: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/71555-6026691/version-9608983.pdf Version B: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/71555-6026691/version-11239564.pdf

Here’s one that’s just hard to scan by eye because it’s mostly reams of data: Record: https://web-monitoring-db.herokuapp.com/api/v0/pages/3edef8ea-de3f-4771-89f2-92840dad026b Version A: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/74013-6199243/version-9920428.pdf Version B: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/74013-6199243/version-10713675.pdf

And another: Record: https://web-monitoring-db.herokuapp.com/api/v0/pages/563b013c-883f-4099-8c98-ce6059a0b823 Version A: https://edgi-versionista-archive.s3.amazonaws.com/versionista2/74279-6212866/version-11023958.pdf Version B: https://edgi-versionista-archive.s3.amazonaws.com/versionista2/74279-6212866/version-11255938.pdf

neiljp commented 7 years ago

I'm looking into this; see how I get on!

Mr0grog commented 7 years ago

@neiljp Awesome, thanks so much!

neiljp commented 7 years ago

I have the second library working for all 4 examples you listed. I did have minor issues with the 3rd one, which is large, has an offset top page and had lots of extraneous characters which I needed to figure out how to filter out. Would it be helpful to show these images somewhere?

Mr0grog commented 7 years ago

Sure! Go ahead and post them here. If you have this work in a repo, go ahead and link it, too.

Mr0grog commented 7 years ago

Are you on the Archivers Slack group? There’s more “live” conversation there and workflow, process, etc.

neiljp commented 7 years ago

I'm generally not on Slack; is there an IRC mirror somewhere?

neiljp commented 7 years ago

These are the results I have for the 4 tests, with the caveats as above: 1 2 3 4

Mr0grog commented 7 years ago

These are wonderful. :thumbsup: 🎉

Unfortunately, I don’t think there is any mirror of the Slack :\

Mr0grog commented 7 years ago

I did have minor issues with the 3rd one, which is large, has an offset top page and had lots of extraneous characters which I needed to figure out how to filter out.

No worries. I should have been clearer that this doesn’t have to be perfect. Even if there are false positives, being able to identify space people can definitely ignore is a big deal. This is super, super helpful.

titaniumbones commented 7 years ago

Hey @neiljp this looks great, thx. Great to have new people stepping in!

We have been talking about an IRC bridge for a while but haven't set one up - doh!

Mr0grog commented 7 years ago

@neiljp I’m headed out for the night, but will be back on tomorrow at 9-ish Pacific Time if you are planning to do more work on it. I will also try and sign into the global sprint Gitter.im if you are using that (I did not do a good job of paying attention to it today, sorry).

Looking forward to getting this integrated as a running service!

neiljp commented 7 years ago

@Mr0grog I'm back and on the gitter chat now. Re chat: I'm currently on IRC (freenode, oftc), matrix.org and also experimenting with zulip (after some pycon sprints). While I'm moving on with looking into this, were other online services looked at? Or is it that they cannot be deployed with different resource limitations, for example?

Mr0grog commented 7 years ago

were other online services looked at? Or is it that they cannot be deployed with different resource limitations, for example?

No—diffing PDFs is something that we simply haven't had time to get to at all yet.

In general, we haven’t found any great diffing services that either we can deploy feasibly or third party ones that we can integrate with and easily display the diff results in our own UI alongside forms and other visualizations for analysts.

neiljp commented 7 years ago

Progress today has my flask implementation (locally) working with the library and generating a png in the browser; how would you deploy that? I could try and deploy to a server I have access to, in theory.

Mr0grog commented 7 years ago

We don’t have a great deploy process for anything that’s not Heroku yet—it’s very ad-hoc on Amazon EC2. If you can deploy to a server you manage and document the process, that’d be great.

neiljp commented 7 years ago

Apparently flask works on heroku; the trick might be installing the other module(s), including one that I built as binary, though might not strictly need to be.

Mr0grog commented 7 years ago

Ah, yeah, binaries can be complicated on Heroku. You have to create a “buildpack:” https://devcenter.heroku.com/articles/buildpacks

Mr0grog commented 7 years ago

@neiljp Did you get anywhere on this? If not, do you mind posting what code you’ve got somewhere so others can help on this? Thanks!

neiljp commented 7 years ago

@Mr0grog I didn't get any further than getting it to work locally in the end, but have submitted some PRs against the lib I used, and hope to document the process ASAP.

Mr0grog commented 7 years ago

@neiljp Any updates on this?

neiljp commented 7 years ago

@Mr0grog Apologies, I got swept up in contributing to Zulip after PyCon. I'm now getting back to this, though I note there is other progress?

Mr0grog commented 7 years ago

@neiljp Yeah, we sorta have a more defined way to do this now. You can add your work as a module in the https://github.com/edgi-govdata-archiving/web-monitoring-processing repo, in the web_monitoring folder. There’s not much documentation on how the built-in diff server there works yet, but you can look at PR #59 in that repo. @danielballan can probably also help you out.

Mr0grog commented 7 years ago

Hey, @neiljp, just checking in. Any updates or anything I can help with here?

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

Mr0grog commented 5 years ago

Well, this is still pretty critical. It would be lovely to get some help from someone on this, but it does need to get done.

0xrishabh commented 4 years ago

Hey if the issue still alive, I will like to contribute.

Mr0grog commented 4 years ago

Hey @cYph3r1337, that would be great. These days, all the diff-related code lives in the web-monitoring-processing repo in the web_monitoring/diff directory.

You can then make your differ accessible via HTTP by adding it to the server here: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/e77cc3cb56b9d66c82e3ad59f071d9d12b87254a/web_monitoring/diff_server/server.py#L30-L53 Basically, this just maps a part of the URL path to a function. The server will examine your argument names to figure out what to send it. More info on that here: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/e77cc3cb56b9d66c82e3ad59f071d9d12b87254a/web_monitoring/diff_server/server.py#L455-L465