Kledsky / s3fuse

Automatically exported from code.google.com/p/s3fuse
Other
0 stars 0 forks source link

Feature request: boundless hierarchical storage #10

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Hi there, this is tremendous software - thank you. This issue is a feature 
suggestion.

Would it be possible to support a fixed size local cache which answers for 
recently accessed files?

The objective would be to achieve boundless hierarchical storage, whereby files 
start on local disk, are backed up to S3, then migrate to glacier, and are 
called back on demand if they have been dropped from local storage.

It may make sense to wait until a file remains unchanged for a user-selected 
period before it is uploaded.

I realise that multi-user scenarios would be problematic but this would still 
be valuable for single-user volumes. 

Thanks,
Alister.

Original issue reported on code.google.com by goo...@shortepic.com on 23 Jul 2013 at 1:52

GoogleCodeExporter commented 8 years ago
This is on the roadmap. However, I'm not sure I'm comfortable with the idea of 
delaying file upload until some time after the file's closed. Handling uploads 
asynchronously means that we can't notify applications/users of transfer 
failures. People using s3fuse for backup will find this less than ideal.

Original comment by tar...@bedeir.com on 23 Jul 2013 at 8:33

GoogleCodeExporter commented 8 years ago
The transfer failure scenario would also come up in hierarchical storage when 
the filesystem is mounted while the host is disconnected. Could the migration 
status be an xattr? 

The delay is is to reduce situations where multiple small changes to large 
files triggering a series of large uploads. Perhaps the delay could be a 
configuration item, defaulting to immediate upload if you think.  

Can I offer you a bounty for the hierarchical storage feature?

Thanks!

Original comment by goo...@shortepic.com on 25 Jul 2013 at 10:03

GoogleCodeExporter commented 8 years ago
I'd be comfortable doing something like this so long as it isn't default 
behavior. Just to be sure I understand correctly, what you're asking for is:

1. A local, fixed-size cache of recently-accessed files.

2. Asynchronous file uploads, i.e., flush() and close() return immediately 
rather than block waiting for the upload to complete.

3. A configurable delay before files are uploaded.

4. An xattr indicating file upload status.

5. Some mechanism to deal with failed transfers and synchronization conflicts.

My questions for you are:

1. Do we need to be able to mount a bucket while the host is disconnected? This 
implies caching directories, metadata, etc.

2. Further to #5 above: what happens if a file is externally modified after 
being downloaded, such that upon upload we're now overwriting someone else's 
changes? This is easy enough to detect, but how should s3fuse behave upon 
determining that this has happened?

3. Also further to #5: if a transfer fails, do we then make that file entirely 
inaccessible? Or do we let the user continue to open/read/write/close the file? 
Do we eventually give up on ever uploading the file? If so, when?

I think this could be an interesting feature to have, I'd just like to better 
understand how we'd expect it to behave.

Original comment by tar...@bedeir.com on 4 Aug 2013 at 11:23

GoogleCodeExporter commented 8 years ago
1. Yes - ideally you would be able to work on the volume during an airplane 
trip and have it recover when you go online (obviously, you couldn't read any 
old files). The entire directory structure should be mirrored - no need to 
expire directories.

2. My scenario is single-user (one bucket per host), so not a big issue for me 
- whatever is easiest. Host is master and silently overwriting cloud would meet 
my requirement. Even in a multi-user scenario, this would be understandable.

3. Must continue to allow access (eg. to allow further editing before the next 
internet connection). If there is a problem with a file (eg. filename illegal 
in cloud) is this a specific value of the upload status?

In terms of minimum viable product, is the order of delivery:
v1. retains all files locally; errors on disk full or set limit; online only.
v2. offline ok. periodically sweeps for changed files and pushes to cloud when 
connected; xattr status;
v2a. custom sweep schedule based on frequency, time of day, user intervention 
(prevent/force start), maybe network connection.  
v3. periodically truncates oldest atime files to keep target amount or % local 
space in reserve.
v4. downloads to restore previously truncated files on open.

This would be superior to all opaque backup systems like dollydrive etc. 

I have an architectural question for you though: is this the right way to use 
fuse? A completely different strategy would be some kind of daemon that scanned 
the native filesystem, uploaded and truncated in the background, and perhaps 
only restored on user request. Or taking another tack, if the fuse fs wasn't 
mounted over the top, could the local files still be accessible (other than 
truncated)?

Many thanks,
Alister.

Original comment by goo...@shortepic.com on 5 Aug 2013 at 11:01