Build PageFreezer-Outputter that fits into current Versionista workflow

edgi-govdata-archiving / web-monitoring

Documentation and project-wide issues for the Website Monitoring project (a.k.a. "Scanner")

Creative Commons Attribution Share Alike 4.0 International

105 stars 17 forks source link

Build PageFreezer-Outputter that fits into current Versionista workflow #9

Closed dcwalk closed 7 years ago

dcwalk commented 7 years ago

From @ambergman on February 10, 2017 8:5

To replicate the Versionista workflow with the new PageFreezer archives, we need a little module that takes as input a diff already returned (either by PageFreezer's server or another diff service we build), and simply outputs a row in a CSV, as our versionista-outputter already does. If a particular URL has not been altered, then the diff being returned as input to this module should be null, and no row should be output. Please see @danielballan's issue summarizing the Versionista workflow for reference.

I'll follow up soon of a list of the current columns being output to the CSV by the versionista-outputter, a short description of how the analysts use the CSV they're working with, and some screenshots of what everything looks like for clarity

Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#17

dcwalk commented 7 years ago

From @allanpichardo on February 10, 2017 15:7

From what I've observed with the Pagefreezer API, taking a diff of 2 pages takes an average of 5 seconds. If there are ~ 30,000 pages to monitor, then pagefreezer is probably not the most appropriate diff service for this task. I think the main bottleneck in Pagefreezer is that they transcode the diff information into HTML for every request. I have run similar diffs on my machine using Git diff and it usually takes one second or less.

Here's what I suggest:

We make a command line tool that creates git diffs of 2 pages and saves them to file (or a local database). This command line tool can take optional filters so we can remove unimportant parts of the page over time. The cli tool could be set as a cron task on a server and run daily to diff the 30,000 pages in the background.
If a significant difference is found, then the CLI creates the CSV entry for the analysts as per your description.
Since the last diff has been stored as a file (or database row) the git visualizer can pull that and do the parsing at that particular time. Thus we incur the cost of transcoding to HTML only when it's necessary.

dcwalk commented 7 years ago

From @titaniumbones on February 10, 2017 15:14

I love the idea. Question : what if the HTML (DOM) context is what tells us whether a diff is significant?

On February 10, 2017 10:07:57 AM EST, Allan Pichardo notifications@github.com wrote:

From what I've observed with the Pagefreezer API, taking a diff of 2 pages takes an average of 5 seconds. If there are ~ 30,000 pages to monitor, then pagefreezer is probably not the most appropriate diff service for this task. I think the main bottleneck in Pagefreezer is that they transcode the diff information into HTML for every request. I have run similar diffs on my machine using Git diff and it usually takes one second or less.

Here's what I suggest:

We make a command line tool that creates git diffs of 2 pages and saves them to file (or a local database). This command line tool can take optional filters so we can remove unimportant parts of the page over time. The cli tool could be set as a cron task on a server and run daily to diff the 30,000 pages in the background.

If a significant difference is found, then the CLI creates the CSV entry for the analysts as per your description.

Since the last diff has been stored as a file (or database row) the git visualizer can pull that and do the parsing at that particular time. Thus we incur the cost of transcoding to HTML only when it's necessary.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/edgi-govdata-archiving/pagefreezer-cli/issues/17#issuecomment-278969191

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

dcwalk commented 7 years ago

From @allanpichardo on February 10, 2017 15:29

@titaniumbones Yeah, I suspect that it will. If we do this with Node.js, then we have the option of using jQuery to parse the HTML archives. We can determine that certain HTML nodes are insignificant, such as <head meta> for example or <rel type="stylesheet"> etc etc... Therefore, at the time that an archive is loaded from disk, then the CLI uses jQuery to delete such nodes from the text blob and then execute the diff.

The diff would look like a regular git diff, but then the visualizer would have some logic that could convert the ++++ ----- syntax into <ins><del> syntax and we'll output that on screen.

When viewing the page in the visualizer, an analyst would have the option of selecting a DOM element and saying that it's insignificant, thus adding it to an ongoing list that would be fed back to the CLI on the next cycle.

dcwalk commented 7 years ago

From @ambergman on February 10, 2017 16:28

@allanpichardo Really disappointing to hear the PageFreezer API moves so slow but, as you've described, it seems like we have plenty of of other options (and, of course, we always new we had git as a backup). Your 1-3 above sound really great, and I think it makes perfect sense, as you said in point 3, to only parse the diff in the visualizer when it's called back up by an analyst.

Regarding you last comment - I think it sounds great to have the option in the visualizer, maybe in some simple dropdown form, of marking a diff as insignificant. That'll mean everything lives just in that simple visualizer, and everything else will run in the CLI.

The only other thing to add, then, would be to perhaps have a couple of different visualization options, perhaps a "side-by-side" or "in-line" page view for changes, but then also a "changes only" view (very useful for huge pages with only a few changes. I'll write something about that in issue #19, the visualization issue, as well.

dcwalk commented 7 years ago

From @titaniumbones on February 10, 2017 17:0

I think this is great. In yr opinion are there pieces of this I should ask folks to work on in SF tomorrow?

On February 10, 2017 10:29:33 AM EST, Allan Pichardo notifications@github.com wrote:

@titaniumbones Yeah, I suspect that it will. If we do this with Node.js, then we have the option of using jQuery to parse the HTML archives. We can determine that certain HTML nodes are insignificant, such as <head meta> for example or <rel type="stylesheet"> etc etc... Therefore, at the time that an archive is loaded from disk, then the CLI uses jQuery to delete such nodes from the text blob and then execute the diff.

The diff would look like a regular git diff, but then the visualizer would have some logic that could convert the ++++ ----- syntax into ~~syntax and we'll output that on screen.~~

When viewing the page in the visualizer, an analyst would have the option of selecting a DOM element and saying that it's insignificant, thus adding it to an ongoing list that would be fed back to the CLI on the next cycle.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/edgi-govdata-archiving/pagefreezer-cli/issues/17#issuecomment-278974873

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

dcwalk commented 7 years ago

From @allanpichardo on February 10, 2017 17:15

@ambergman Yes, here's what I see overall at a high level,

I understand that the 30,000 URLs are kept in the spreadsheet. Those are live URLs. Where are the previous versions archived? Is it held remotely on another server?

I ask because if it's possible to have a structure such that:

There is a directory which holds the last known HTML of each page and each filename, for simplicity, could be the URL itself. (plaintext)
Start node script that can open the spreadsheet and go line by line, comparing the stored HTML in the directory with the live version of the page, and update the spreadsheet accordingly for each HTML. (I say Node because JavaScript is so good for DOM traversal, but if there's a better idea for this, I'm all ears)
(this part needs to be worked out) By some heuristic, determine if a diff is significant enough to keep, and store that diff as a text file in another directory.
If the diff text file is readable from the web, that would be OK (preferable), otherwise, we would have to insert the path and some kind of ID into a database table. The visualization link can be something with the ID of the diff text.
Insert that URL in the spreadsheet
Have the visualizer recognize the IDs and pull the corresponding diff file and display it.

@titaniumbones If this architecture is something we can work with, then maybe the SF group can set up the directory structure, and put some test files in it, and start a cli utility that would read the files and use git to diff them with the live URLs and save the diffs into another directory.

dcwalk commented 7 years ago

From @titaniumbones on February 10, 2017 17:40

Allan Pichardo notifications@github.com writes:

@ambergman Yes, here's what I see overall at a high level,

I understand that the 30,000 URLs are kept in the spreadsheet. Those are live URLs. Where are the previous versions archived? Is it held remotely on another server?

We don't know where the long-term storage will be. We have talked to a bunch of differnt sponsors about this and have not been able to get a firm commitment from anyone yet. For now they are stored on a cluster that's not entirely easy to access from the web. I've exposed the zipfile for the 30,000 (actually for some reason there are 2 zipfiles, one large and one small, fro mthe same day) & a couple of the domains here, over http:

http://edgistorage.hackinghistory.ca/

You can dl it yourself there, but the large zipfile is about 7g.

I've also unzipped the zipfile in http://edgistorage.hackinghistory.ca/storage, and you can see the funky directory structure there.

I think there's something about this in the docs in pagefreezer-cli, but I'm on a low-bandwidth connection in the airport and browsing is a little hard.

I ask because if it's possible to have a structure such that:

There is a directory which holds the last known HTML of each page and each filename, for simplicity, could be the URL itself. (plaintext)

take a look at the zipfile structure. Probably we could do that but it might be a little frustrating.

Start node script that can open the spreadsheet and go line by line, comparing the stored HTML in the directory with the live version of the page, and update the spreadsheet accordingly for each HTML. (I say Node because JavaScript is so good for DOM traversal, but if there's a better idea for this, I'm all ears)

yeah sounds great. I think ruby & python also have dom-arware diff programs, again see docs.

(this part needs to be worked out) ^^ yup!!

By some heuristic, determine if a diff is significant enough to keep, and store that diff as a text file in another directory.

If the diff text file is readable from the web, that would be OK (preferable), otherwise, we would have to insert the path and some kind of ID into a database table. The visualization link can be something with the ID of the diff text.

Whatever solution we come up with, we will need to make all this stuff accessible to the web. So, let's take that as a given when we're building a test case.

Insert that URL in the spreadsheet

Have the visualizer recognize the IDs and pull the corresponding diff file and display it.

yup, sounds great.

dcwalk commented 7 years ago

From @allanpichardo on February 10, 2017 20:30

@titaniumbones the directory structure from the zip files will work because the url is preserved in the file structure. So, it seems that the directories per domain mirror the exact directory structure that is remote. So this is good for knowing what compares to what. The only issue is that the archives come with a lot of other files that we don't need, but that's OK, for this purpose, we can just traverse the tree and take html files and ignore the others.

So I suppose, wherever this service runs, it could download an archive zip, extract it, and run through it creating the diffs and updating the spreadsheet. When the process is done, it can delete the downloaded archive. Then rinse and repeat.

dcwalk commented 7 years ago

From @lh00000000 on February 10, 2017 23:41

concerns

how will this address sites with dynamically loaded (e.g. react js) content?
i suspect the pagefreezer isn't really "slow". my guess is that the main causes for latency are network (nontrivial payloads sent to them), and that they might be running the pages in a headless browser and using a few seconds to allow clientside stuff to happen (in my experience, a lot of sites have low-priority / below the fold stuff won't finish loading until 3 or more seconds later) and that they're equiped to handle a bit (polite) multithreading. the issue of 10K daily limit remains though.
i've had bad experiences using jquery selectors to clean up html at scale. things like throwing out and Githubissues.
Githubissues is a development platform for aggregating issues.