Closed titaniumbones closed 7 years ago
I'm no data storage expert, but a layered/union file system could potentially help.
https://docs.docker.com/engine/userguide/storagedriver/aufs-driver/ https://en.wikipedia.org/wiki/Aufs https://en.wikipedia.org/wiki/OverlayFS
agreed. do you know wwhatthe performance penalty for using such a file system is?
On 02/02/2017 11:39 AM, Ates Goral wrote:
I'm no data storage expert, but a layered/union file system could potentially help.
https://docs.docker.com/engine/userguide/storagedriver/aufs-driver/ https://en.wikipedia.org/wiki/Aufs https://en.wikipedia.org/wiki/OverlayFS
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/edgi-govdata-archiving/pagefreezer-cli/issues/4#issuecomment-277010037, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWPNDIHLf-la8E7GkFSuIc_Tgf17m_-ks5rYgavgaJpZM4L1Ply.
is the problem limited storage capacity? since it seems like diffs between a snapshot a long time ago and a snapshot a long time + a day ago won't be as relevant as diffs between the latest snapshot and the second-to-latest snapshot, maybe some kind of strategy that archives snapshots to cheaper places like aws glacier (or s3 'Infrequent Access") in an automated way might be an alternative? keeping full snapshots is expensive storage-wise, but it's easier to grok / try new data-transformation ideas on later.
also written on the plane and a little out of doubt, but, well, still posting:
@lh00000000 Sorry I missed this in a flood of notifications. Interesting idea. hmm. I'm wondering how common the situation is, in which an analyst will want to consult a long time series of diffs, and how expensive that would be to access.
I think time series data will be of interest to social scientists eventually, but for now maybe not so much. Maybe @trinberg or @ambergman have thoughts?
This issue was moved to edgi-govdata-archiving/web-monitoring#8
It's notimmediately obvious how to turn our giant stores of files into diffs so that we can minimize file storage/transfer issues. At present we have lots of data duplication. Looking for a data engineer here I think.