Open titaniumbones opened 7 years ago
How do you recommend calculating the size of the diff?
Are we storing full WARCs? Is the concern WRT impact on available storage capacity? I've not yet seen a system for deduplicated WARC storage+access, which I think might be required to make diff size useful.
@geppy no, we're not storing anything ourselves, this is a temporary mointoring UI which will probably be used another ~4 weeks (i hope not longer!). Data is stored by a private company in their db, it's not a great arrangement (we are crawling ourselves elsewhere, but no access yet).
@jpmckinney, if you go to the page view site -- the one we're skipping over w/ the bookmarklet -- the absolute size of each version is stated in a column. Was hoping we could just cheat & use their numbers.
Otherwise i guess we'd have to grab those revisions ourselves & calculate manually ,right? seems slower + harder.
@titaniumbones We can't get the actual size of the diff from those numbers. Those numbers are the total sizes of each version. If we subtract one from another, we just get the difference in size between the versions - we don't get the diff size.
Compare diffing two identical strings except that one string has an extra letter (e.g. aaa
and aaab
), versus where all letters are different but the two strings are the same length (e.g. abcde
and 12345
). Doing subtraction, you'd get a diff size of 1 for the first case and 0 for the second case - but clearly the diff in the second case should be bigger.
Since many/most changes are substitutions rather than additions/deletions, subtraction is not a good idea.
As for calculating the diff ourselves, or extracting that metadata from GitHub, the bookmarklet can't do that, for the reasons in #2.
Right, duh
On January 26, 2017 11:53:36 PM EST, James McKinney notifications@github.com wrote:
@titaniumbones We can't get the actual size of the diff from those numbers. Those numbers are the total sizes of each version. If we subtract one from another, we just get the different in size of each version - we don't get the diff size. Compare diffing two strings where one letter is added to one, versus where all letters are changed, but the strings are the same length. Doing subtraction, you'd get a diff size of 1 for the first case and 0 for the second case - but clearly the diff in the second case should be bigger. Since many diffs are substitutions rather than additions/deletions, subtraction is not a good idea.
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/edgi-govdata-archiving/version-tracking-ui/issues/1#issuecomment-275588799
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Yeah, that suggests that if the button is implemented we've gone about as far as we can without switching to a chrome extension.
On January 26, 2017 11:57:00 PM EST, James McKinney notifications@github.com wrote:
As for calculating the diff ourselves, or extracting that metadata from GitHub, the bookmarklet can't do that, for the reasons in #2.
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/edgi-govdata-archiving/version-tracking-ui/issues/1#issuecomment-275589112
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Button isn't implemented yet (#4), but can still be done with a bookmarklet.
add column to table with size of diff.