Integrate web monitoring diff database efforts

dcwalk commented 7 years ago

From @ambergman on February 24, 2017 9:46

Wanted to summarize my thoughts here after a great conversation with @danielballan last night and hearing about the great work he and @Mr0grog are doing to coordinate their efforts. Apologies to @Mr0grog, @danielballan, and others if this issue frustrates work at all - happy to take it down and let you all lead:

The conversation about building EDGI's web monitoring software, and a diff database in particular, has been framed by @titaniumbones, myself, and others as a migration from using Versionista to using PageFreezer's snapshots. I wanted to suggest that that may have been a mistake and that, instead of a migration, we'd actually like to integrate our two sources and build a diff database that can store data coming from Versionista, PageFreezer, and any other credible source. This will be important in the short term, as we have snapshot history going further back and at a higher frequency with many of the pages we're watching with Versionista - so we don't want to lose that information after we start using PageFreezer's snapshots. As conversations with the Internet Archive progress, we'd definitely also like to make sure all of the material from the Wayback Machine can be read into our DB as well.

Because Versionista and PageFreezer output different data in different formats, reading in data from the two sources will require different interfaces. And so it's great that the two interfaces are being developed separately in different apps for now - see @Mr0grog's repo for the Versionista app here - and it's great that Dan and Rob are working together to determine how to combine their efforts. In both cases the html data taken in can be converted to a series of diffs, and those diffs can be stored in one big database - with one additional column to denote where the material used to produce the diff came from. Down the road, we can even decide to store diffs made from two html snapshots from two different sources - but I think we can save that for later, perhaps if we've loaded everything into one snapshot database at IA at some point.

So, in short, I think it would be great to think about how to integrate the Versionista and PageFreezer diff databases, not just migrate between them. I know I haven't been specific about interfaces at all here, so I'm sure this wasn't all that helpful in terms of considering what schema to actually use to integrate the two sources - but that's probably the topic of a series of other issues. Let me know what you all think.

Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#29

dcwalk commented 7 years ago

From @danielballan on February 24, 2017 15:52

Thinking on this since our conversation last night, I've become convinced that @Mr0grog and I should integrate our efforts starting now, and I have a proposal for how that could work. We need one working prototype that speaks Versionista and PageFreezer and serves as a target for both fly-by contributions at future events and more sustained maintenance from the community. Here's my pitch:

Implement the web server in Rails, starting from Rob's work here.
Refactor the UI work in this branch on top of the Rails server, carrying over ideas from @allanpichardo's Express-based server where we can, but ultimately abandoning that server implementation. (Thanks for your work, @allanpichardo! It helped us understand the problem better.)
Run a separate service that communicates with the web server via SQL databases. This component, implemented in Python, would be responsible for issuing PageFreezer queries, doing data processing tasks on them, and building up a queue of diffs to be served by the Rails web server. Starting from my working rough sketch, we can incorporate @stuartlynn's Python module, @lh00000000's experiment with the newspaper module, and planned filtering/prioritization work.

In my opinion, this structure would employ the best tool for each aspect of our task and engage as many interested contributors as possible. And by using databases for communication between the web server and the data processing service, we can easily plug in additional services (from the Java world, for example) or rethink one piece without rewriting the other.

dcwalk commented 7 years ago

From @Mr0grog on February 24, 2017 18:28

I've become convinced that @Mr0grog and I should integrate our efforts starting now

Ha, and here I was becoming more certain of the opposite after our discussion, @danielballan!

Some thoughts: I totally agree that these things should ultimately be integrated, but because handling the two sources carry very different technical needs (and potentially also different differing characteristics affecting their ideal end use), I think it’s good that they develop separately for now. We should at least get a better handle on each source before trying to reconcile them. I don’t feel 99+% confident that I’ve totally got the right stuff from Versionista yet, and managing that issue while also making sure we are comfortably storing and processing PageFreezer data in the same project makes identifying and solving those issues harder (e.g. as a result of my process gathering changes at a much more granular time interval than the current, manual script, I discovered some date processing issues yesterday).

That’s not to say we shouldn’t make sure the continuing development of each shouldn’t be informed by what’s being done for the other—making sure we are aware of what the other is doing will make that ultimate integration much easier.

we have snapshot history going further back and at a higher frequency with many of the pages we're watching with Versionista - so we don't want to lose that information after we start using PageFreezer's snapshots.

I’m generally down with this idea (for sure with keeping the history of what we currently have from Versionista), but I’m not supremely confident that continuing to use Versionista to get new content for the foreseeable future is great. That’s mainly because of how we currently scrape it instead of having an API, though. Like all scrapers, ours is guaranteed to break—repeatedly—over time. My experience with that situation in volunteer projects is that the scraper also tends to break with increasing frequency over time: often the original author isn’t there to fix the break, so someone else does in a slightly kludgy and less reliable way because they aren’t familiar with the codebase. Later, it’s a third (or fourth or fifth) person doing the fixing and the code becomes a total, utter mess. (To be clear, this happens even when the fixers are amazing programmers; it’s a problem that stems from lack of long-term experience with a codebase and a defined architectural style, not from lack of expertise.)

If we could get Versionista to develop an API that at least suits our needs, of course, that would go a long way towards alleviating my concern here.

dcwalk commented 7 years ago

From @danielballan on February 24, 2017 19:40

OK, I think we're on about the same page. Your point about getting a handle on the respective APIs first is well taken.

By "integrate our efforts," I mean "define mostly non-overlapping scopes for the two components so combining them doesn't mean throwing away very much code." Each project can continue to develop semi-independently. But instead of planning a transition where we deprecate the Versionista Rails app in favor of a PageFreezer Python app someday (as you mentioned in passing on Slack), I'm now thinking that the Rails app should be the permanent front end / web server. To start, the Python service could even present a Versionista-like API to PageFreezer data. (I doubt this is actually the way to go, because that conversion would be lossy, but it's an example of how we could orchestrate a smooth transition.)

So, going forward, I'm proposing to keep developing my PageFreezer request and processing code but not to build a web server or UI layer on top of that. Does that sound right to you, or do you think more extensive independent development is necessary?

@ambergman tells me there are many devs waiting in a wings for something to work on, so hashing out a more specific plausible roadmap, maybe over another call, would be useful.

dcwalk commented 7 years ago

From @Mr0grog on February 27, 2017 17:48

By "integrate our efforts," I mean "define mostly non-overlapping scopes for the two components so combining them doesn't mean throwing away very much code."

Ah! Sorry, I misunderstood.

the Python service could even present a Versionista-like API to PageFreezer data

Hmmmm, I like the idea of them operating similarly, but what we have for Versionista right now is pretty dodgy:

It returns results in a format designed to be directly output as a human-readable CSV, which is not as organized or friendly as it could be to other code.
The general idea it follows right now, which is a synchronous call to go get all the changes for a time period, is problematic, not least because this can and should be better accomplished as a series of parallel, asynchronous operations (see also https://github.com/edgi-govdata-archiving/webpage-versions-db/issues/3)

TL;DR: totally! But I’m not even sure exactly how our “API” to Versionista should best be structured. There’s probably lots we can more easily solidify on in terms of DB structure, though.

So, going forward, I'm proposing to keep developing my PageFreezer request and processing code but not to build a web server or UI layer on top of that.

Sounds good to me for the moment, at least. Minor question: what kind of trigger do we have for receiving PageFreezer updates/notification that PageFreezer updates are available?

hashing out a more specific plausible roadmap, maybe over another call, would be useful.

👍

dcwalk commented 7 years ago

From @danielballan on February 27, 2017 18:25

Great. Yeah, sorry for not expressing myself very clearly in the first pass. :- )

I think we're in agreement that aiming for compatibility at the database level is the way to go. Imitating a Versionista-like "API", even if it could be done, wouldn't be very helpful or worth the effort, so I think we can discard that idea.

What kind of trigger do we have for receiving PageFreezer updates/notification that PageFreezer updates are available?

Currently, we receive periodic data dumps from PageFreezer in a file-sharing website. The data is organized in zip files with XML labeling and timestamping their contents. (As you know, this is the raw HTML. The diffs still need to be computed through a separate request.) To start, I think an admin can manually tell the PageFreezer data processing backend, "There is a directory of new zip files at [some path]. Process it." As the backend requests, filters, and prioritizes diffs, it adds them to a postgres DB that is also accessible to the Rails app.

dcwalk commented 7 years ago

From @Mr0grog on February 27, 2017 19:33

To start, I think an admin can manually tell the PageFreezer data processing backend, "There is a directory of new zip files at [some path]. Process it."

Totally. Just wondering if there’s a way (in the medium-term future) for the app to receive an e-mail or webhook notification so nobody has to be a manual button-pusher.

dcwalk commented 7 years ago

From @danielballan on February 27, 2017 22:11

I think that the way the data arrives will change once PF gets the Google storage set up. I expect we can work something out with them as soon as that's done.

dcwalk commented 7 years ago

Hey all -- I think this is mostly historical and our new structure ui | processing | db counts as an "integration," I've added @ambergman's thoughts to the wiki of web monitoring projects past (https://github.com/edgi-govdata-archiving/web-monitoring/wiki) for like, history, and am gonna go ahead and close this.

Please reopen if you feel like this conversation needs to continue here

Mr0grog commented 7 years ago

Postscript: @danielballan and I talked a little Thursday morning about aligning DB schemas so we can handle Pagefreezer and continue to support all the things we have/want to have that have evolved in my work so far on web-monitoring-db.

His changes for that can be seen in this PR: https://github.com/edgi-govdata-archiving/web-monitoring-processing/pull/22

And mine are ongoing in a branch named db-alignment in this PR: https://github.com/edgi-govdata-archiving/web-monitoring-db/pull/15

edgi-govdata-archiving / web-monitoring

Integrate web monitoring diff database efforts #13