Doloops / mcachefs

mcachefs : Simple filesystem-based file cache based on fuse
64 stars 15 forks source link

Issue #14 - basic size consistency check for backed files #20

Closed hradec closed 4 years ago

hradec commented 5 years ago

introducing a basic size and modified time consistency check of backed files, so mcachefs can re-download broken files.

It gets metadata size/mtime in mcachefs_open() were it's already locked, so we don't introduce a new metadata lock/unlock block, avoiding the risk of deadlock.

size /mtime are is stored in a st struct, and passed down to mcachefs_open_mfile(), and then to mcachefs_check_fileincache(), were we do the size /mtime checking.


currently we can't do mtime check since the backed file doesn't reflect the original mtime of the backend file when it was first backed up.

We need to set the backed atime/mtime/ctime of the cached file when it finishes downloading, so we can use mtime to detect if the cached file needs to be refreshed from the backend.

hradec commented 5 years ago

this is the correct fix for issue #14

Doloops commented 4 years ago

Hi @hradec

I gladly merged your two other PR's (after some time, sorry about that), but this one is a bit more tricky.

I am still confused on how to properly solve this "who modifies what, mcachefs needs to know but is not allowed to ask" puzzle.

Consider a file f, existing in the remote filesystem, and with a mtime at t0. When copying locally (backup), keeps a mtime of t0.

Local modifications, t0 becomes t1 (locally). That, we know well. Remote modifications, t0 becomes t2 (remotely). That, we can't know except by checking remotely the mtime from time to time.

But there is no certainty that t1 > t2 or t1 < t2, and that we just want to keep the freshet of the two ! Especially if there are some relationships between multiple files...

At least we should keep a track of t0 somewhere (in the metadata ?). If remote has not changed at all between (and thus keeps a mtime=t0), then we have the freshest version no matter what. If local still has mtime=t0 and remote as mtime > t0, we need to upgrade for sure. But what to do if local mtime != t0 and remote mtime != t0 ?

And then, there is issue #14 ! Fundamentally, detecting that a file transfer has gone wrong may not be that different than detecting that a file transfer must be refreshed (= has become wrong). I was thinking of a more flexible mechanism where we may not need to keep the whole file in local cache (keeping only beginning of files for preview of videos for example).

By the way, I did re-indent the whole code (was getty very messy), so they won't merge properly... (Sorry for this as well).

But I plan on renaming all the backup/writeback source/target mountpoints to a more consistent naming pattern... Just don't know which yet.

hradec commented 4 years ago

before anything, I'm going to define this just for my own sanity(I have dislexia, so this helps me! LOL): t0 = original mtime already stored in metadata t1 = local mtime t2 = remote mtime

If local still has mtime=t0 and remote as mtime > t0, we need to upgrade for sure.

indeed. Actually, that's what this pull request is all about. I need to be able to update local files in some way, so I made this change to allow for a re-mount to detect diferences and update the local cache, since after a re-mount mcachefs seems to retrieve metadata from the backend again (I didn't confirm this in the code, but that's seems to be the behavior I observed.)

without this change, even deleting the metafile will not trigger local files to be updated.

But I already ran into the problem of having a backend modified file not being detected because the backend change didn't resized the file.

So I think I need to keep comparing t2 and t1. I've removed this comparison in 45e54db2c12169f89c7da2d1dbcfd688699d7269 commit because I noticed the cached files don't have the original t0 from the backend. In fact, they have their own mtime which is the atime it was created when cached. So, in fact, all cached files have a t1>=t0 after they are first cached.

Which is a problem if we want to detect if a cached file has being modified or not, in case the metafile/journal was deleted or emptied. In that case, the only way we can detect if a file needs to be updated is by file size, since t1 is always >= to t0. (we never known if a local cache was locally modified or not)

In my opnion, the correct thing to do (correct me if I'm wrong) is to make sure the t1 is equal to t0 after being cached! That way we have t0 "stored" in the local cache, even if we loose the metadata! And if we modify it locally, even if we loose the metadata/journal, mcachefs knowns the local cache is newer than the remote because it's t1 is newer than t0.

I'm going to review this pull request, removing 45e54db2c12169f89c7da2d1dbcfd688699d7269 and making sure t1 = t0 when a file is cached, so it can correctly check t1 against t2. Make sense?

But what to do if local mtime != t0 and remote mtime != t0 ?

That's a "no-solution" problem, isn't it? For example, I think the most famous file sync tool "for the masses", dropbox, also has no solution for it. In fact, when you come into the situation of having the same file being modified in 2 different places without being synced for a while, once they are synced, dropbox creates a copy of the file with an added suffix informing you where each copy comes from. There's no way for it to known which one is the right one, so it let you decide!

We could create a similar mechanism... pulling the backend file with an added suffix informing the user the files now differ on booth locations. Maybe have a .mcachefs/outofsync_files so we have a centralized location with a list of "clashes"?

This way the user can choose to overwrite the local copy by just moving the suffixed over. After that, the local copy would be back in sync and out the the .mcachefs/outofsync_files.

If the user chooses to keep the local copy, mcachefs will always detect it's out-of-sync, in which case it should now update the suffixed version, in case it's t1_suffixed (just the suffixed mtime.. I don't think we need to keep this file in the metadata) is older than t2.

At least we should keep a track of t0 somewhere (in the metadata ?).

Well... t0 is already in the metadata when you start a mount mcachefs, right? (if we already have a metafile). As soon as mcachefs retrieves t2, it still has t0 before replacing it. At this point, mcachefs already has the data to detect the backend is newer than the local, no matter if the local is modified or not. So I think you're right... I would keep t0 at this point instead of just replacing it. So we still have the t0/t1 relation.

if we choose to create a suffixed copy of the file (in the t1 > t0 and t2 > t0 scenario), we indeed need t0 to detect it. (we may find useful to have t0 for other situations we didn't predicted yet)

After the conflict is resolved, we could turn t0=(current local mtime), which would flag that the file has no conflict. (and we can remove it from .mcachefs/outofsync_files)

And then, there is issue #14 !

you're right... sorting out the t0/t1/t2 conundrum will cover issue #14.

I was thinking of a more flexible mechanism where we may not need to keep the whole file in local cache (keeping only beginning of files for preview of videos for example).

I see... you can only transfer what is requested of a file... no need to transfer everything, if the filesystem got a request of just 100bytes. But in that case, I would allocate the whole size of the file, even if just full of zeros, and use it as a pool to write the requested bits. This way, the local file still matches in size with the remote version. It also has the advantage of not running out of disk space suddenly later on when we got a request to read the entire file.

By the way, I did re-indent the whole code (was getty very messy), so they won't merge properly... (Sorry for this as well).

No worries...

But I plan on renaming all the backup/writeback source/target mountpoints to a more consistent naming pattern... Just don't know which yet.

Cool.. I have to confess I do get a bit confused with the names sometimes!! :P But it's not a big deal!

hradec commented 4 years ago

I'm closing this for now and will create a new one based on the latest master branch, after working on mtime comparing.