edgi-govdata-archiving / version-tracking-ui

ARCHIVED--Bookmarklet to modify UI for Versionista website monitoring
MIT License
3 stars 1 forks source link

show size of diff #1

Open titaniumbones opened 7 years ago

titaniumbones commented 7 years ago

add column to table with size of diff.

jpmckinney commented 7 years ago

How do you recommend calculating the size of the diff?

geppy commented 7 years ago

Are we storing full WARCs? Is the concern WRT impact on available storage capacity? I've not yet seen a system for deduplicated WARC storage+access, which I think might be required to make diff size useful.

titaniumbones commented 7 years ago

@geppy no, we're not storing anything ourselves, this is a temporary mointoring UI which will probably be used another ~4 weeks (i hope not longer!). Data is stored by a private company in their db, it's not a great arrangement (we are crawling ourselves elsewhere, but no access yet).

@jpmckinney, if you go to the page view site -- the one we're skipping over w/ the bookmarklet -- the absolute size of each version is stated in a column. Was hoping we could just cheat & use their numbers.

titaniumbones commented 7 years ago

Otherwise i guess we'd have to grab those revisions ourselves & calculate manually ,right? seems slower + harder.

jpmckinney commented 7 years ago

@titaniumbones We can't get the actual size of the diff from those numbers. Those numbers are the total sizes of each version. If we subtract one from another, we just get the difference in size between the versions - we don't get the diff size.

Compare diffing two identical strings except that one string has an extra letter (e.g. aaa and aaab), versus where all letters are different but the two strings are the same length (e.g. abcde and 12345). Doing subtraction, you'd get a diff size of 1 for the first case and 0 for the second case - but clearly the diff in the second case should be bigger.

Since many/most changes are substitutions rather than additions/deletions, subtraction is not a good idea.

jpmckinney commented 7 years ago

As for calculating the diff ourselves, or extracting that metadata from GitHub, the bookmarklet can't do that, for the reasons in #2.

titaniumbones commented 7 years ago

Right, duh

On January 26, 2017 11:53:36 PM EST, James McKinney notifications@github.com wrote:

@titaniumbones We can't get the actual size of the diff from those numbers. Those numbers are the total sizes of each version. If we subtract one from another, we just get the different in size of each version - we don't get the diff size. Compare diffing two strings where one letter is added to one, versus where all letters are changed, but the strings are the same length. Doing subtraction, you'd get a diff size of 1 for the first case and 0 for the second case - but clearly the diff in the second case should be bigger. Since many diffs are substitutions rather than additions/deletions, subtraction is not a good idea.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/edgi-govdata-archiving/version-tracking-ui/issues/1#issuecomment-275588799

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

titaniumbones commented 7 years ago

Yeah, that suggests that if the button is implemented we've gone about as far as we can without switching to a chrome extension.

On January 26, 2017 11:57:00 PM EST, James McKinney notifications@github.com wrote:

As for calculating the diff ourselves, or extracting that metadata from GitHub, the bookmarklet can't do that, for the reasons in #2.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/edgi-govdata-archiving/version-tracking-ui/issues/1#issuecomment-275589112

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

jpmckinney commented 7 years ago

Button isn't implemented yet (#4), but can still be done with a bookmarklet.