edgi-govdata-archiving / web-monitoring-ui

UI to enable analysts to quickly assess changes to monitored government websites
GNU General Public License v3.0
37 stars 38 forks source link

efficient diff-based storage for archives #4

Closed titaniumbones closed 7 years ago

titaniumbones commented 7 years ago

It's notimmediately obvious how to turn our giant stores of files into diffs so that we can minimize file storage/transfer issues. At present we have lots of data duplication. Looking for a data engineer here I think.

atesgoral commented 7 years ago

I'm no data storage expert, but a layered/union file system could potentially help.

https://docs.docker.com/engine/userguide/storagedriver/aufs-driver/ https://en.wikipedia.org/wiki/Aufs https://en.wikipedia.org/wiki/OverlayFS

titaniumbones commented 7 years ago

agreed. do you know wwhatthe performance penalty for using such a file system is?

On 02/02/2017 11:39 AM, Ates Goral wrote:

I'm no data storage expert, but a layered/union file system could potentially help.

https://docs.docker.com/engine/userguide/storagedriver/aufs-driver/ https://en.wikipedia.org/wiki/Aufs https://en.wikipedia.org/wiki/OverlayFS

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/edgi-govdata-archiving/pagefreezer-cli/issues/4#issuecomment-277010037, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWPNDIHLf-la8E7GkFSuIc_Tgf17m_-ks5rYgavgaJpZM4L1Ply.

lh00000000 commented 7 years ago

is the problem limited storage capacity? since it seems like diffs between a snapshot a long time ago and a snapshot a long time + a day ago won't be as relevant as diffs between the latest snapshot and the second-to-latest snapshot, maybe some kind of strategy that archives snapshots to cheaper places like aws glacier (or s3 'Infrequent Access") in an automated way might be an alternative? keeping full snapshots is expensive storage-wise, but it's easier to grok / try new data-transformation ideas on later.

titaniumbones commented 7 years ago

also written on the plane and a little out of doubt, but, well, still posting:


@lh00000000 Sorry I missed this in a flood of notifications. Interesting idea. hmm. I'm wondering how common the situation is, in which an analyst will want to consult a long time series of diffs, and how expensive that would be to access.

I think time series data will be of interest to social scientists eventually, but for now maybe not so much. Maybe @trinberg or @ambergman have thoughts?

dcwalk commented 7 years ago

This issue was moved to edgi-govdata-archiving/web-monitoring#8