parallel transfer of files, full directory caching at first access and thread/daemon for backend sync!

hradec commented 5 years ago

I have written a cache filesystem as well (hradecFS here on github), and although mine works well, I'm having a lot of trouble with thread locking (my experience debugging multi-thread code on a fuse filesystem has being a nightmare), apart from a major design flaw that I recently figured out and will require a lot of re-coding.

Your mcachefs code seems to use a similar idea as mine (assuming the backend never changes, transferring full size files once they're opened), and I really like how well it behaves, specially with search paths, like PYTHONPATH for example.

I just loved the journal idea... so nice and well implemented! And your mcachefs has a transfer queue, which I never got around writing for hradecFS.

I'm really considering putting my hradecFS aside for now, and implement a few concepts from it on mcachefs, like for example:

1. parallel transfer of file in the queue - This one was huge for hradecFS, specially when you have large files on the backend and everything the queue has to wait for it to finish. Having parallel transfer not only makes the filesystem more responsive, but also improves usage of bandwidth, specially over WAN. After I got parallel transfer working, responsiveness was improved tremendously.

2. full directory caching at first directory access - Every time one file is accessed at a certain directory, hradecFS would cache the whole directory listing, since it was querying the directory backend anyways. By caching the full directory listing, any subsequent files on the same directory would not require a backend query to check for the file. We already known if the file exists there or not! This is a must for search paths like PYTHONPATH, LD_LIBRARY_PATH and PATH. I'm not sure if mcachefs already does this since I didn't examined all the code yet. But if does, great!! (by using it for a bit, it does fells like it does, considering how fast it traverses PYTHONPATH on my tests...)

3. secondary maintenance thread (or a service daemon) to deal with backend updates on cached data - This one was on the final of my list for hradecFS, and I never got there. But the idea was to have something in background checking the backend for changes, and syncing the cache accordingly. This way hradecFS would still "see" changes in the backend, with a little delay after they happen, without any impact whatsoever to responsiveness.

I'm forking your code right now to start working on parallel file transferring. I would love to known what are your thoughts about it, and also about the other 2 ideas.

And I would also love to hear about your future plans for mcachefs!

Last but not least, please let me known about any quirks regarding threading on your code, specially problems you faced and bugs you ran into during development... again, threading has being my nightmare with fuse!! Anything that you feel like sharing would be appreciated!

anyhow, great work and thanks for sharing it!!

amazing! cheers... -H

hradec commented 5 years ago

1. parallel transfer of file in the queue :

Actually, it's already possible to use more than one thread for transferring. In the src/mcachefs-config.c, line 191, one can set the value to more than 1 to set the amount of parallel threads to use for each file type.

I've set 2, and was able to have mcachefs to start a second transfer from the backend in parallel. Only the third transfer would hang in queue.

Is there any particular reason it's hardcoded to 1?

hradec commented 5 years ago

FYI, I'm running it with 8 transfer threads for a day or so, and it's working perfectly... actually, much faster when loading data with a multi-threaded software like Autodesk Maya.

Actually, I've mounted a filesystem from Brazil here in Canada using sshfs, and by using mcachefs on top of it I'm actually able to RUN Autodesk Maya from this filesystem in Brazil, load scenes, make changes and even do 3D renders!

It take a while the first time you load something for the first time, but after that, it's like working locally!

pretty darn cool!!

Doloops commented 5 years ago

Hi Roberto,

I'm glad this piece of code has pleased you ! To be truly honest, it has been a while since I worked on it, too many other (professional) projects to handle. But I still use mcachefs on a daily basis, mainly as a local cache on top of a remote NAS.

Parallel transfer of file in the queue

The default value as 1 for mcachefs_config_get_transfer_threads_nb comes from the fact that this value should be made configurable via some command-line option.

Note : in previous versions of mcachefs, the whole configuration was provided via an external configuration file. This has been dumped because argument-based passing is much more flexible (but I agree this design choice is debatable). At that time it was possible to configure the number of threads in the configuration file.

Another note : the background threads are specialized for 3 different tasks:

backup (write from backend to local cache)
writeback (write from local cache to backend)
metadata (background directory caching, see below)
Full directory caching at first directory access

Yes, mcachefs stores directory content and file stats locally (that's the purpose of the mcachefs-metadata.c file).

Deal with backend updates on cached data

There was some work on partial cleanup of local cache, but mcachefs was built with the idea in mind that the backend filesystem was not changing, and the only changes where provided by the local side via mcachefs (thus, applying journal was enough).

At least, having a post-apply_jouranl refresh phase where we check if there are newer files, or deleted ones, would make sense. Still, some headackes on how to handle conflicts between local and backend files...

Then again, if you have some time to spend on this code, and willing to help, you are more than welcome !

If you need some help for technicalities, subtleties, questions, remarks, please do not hesitate !

Best regards,

hradec commented 5 years ago

Thanks for the detailed reply.. really appreciate!! I wasn't sure about what the other 2 thread types!! Thanks lots!

I must say, I'm loving the whole journal design... it's awesome!

Another thing that you surprised me with was the read_state parameter! At some point, I had the need to cache the directories only, but not cache big files... I started looking at the code to change it and just saw that it WAS IMPLEMENTED ALREADY, by setting nocache to read_state!! Amazing!!

Note : in previous versions of mcachefs, the whole configuration was provided via an external configuration file. This has been dumped because argument-based passing is much more flexible (but I agree this design choice is debatable).

I'm with you. Arguments are way better and much more self-contained and flexible.

At that time it was possible to configure the number of threads in the configuration file.

Gotcha! that's perfect then... I have being running with 8 threads for a while, and it's all good so far!

At least, having a post-apply_jouranl refresh phase where we check if there are newer files, or deleted ones, would make sense.

In my case, I use mcachefs to process files from the backend to create new files on slave machines, over the internet. Immediately after those files have being created, they are uploaded using rsync back to a central server, so they are not important in the local machine anymore. (I plan to use apply_journal or write_state in the future, but since it has being crashing a lot for me, I'm keeping a separated rsync running on cron for now)

That's why I need mcachefs to see new files in the backend independently! For now, I'm basically resetting mcachefs with flush_metadata from time to time.

But in the future I'm planning on write something to, at least, lazily and constantly re-synchronize metadada for directories independently from access, so if new files/folders are created in the backend, they will show up for processing.

Then again, if you have some time to spend on this code, and willing to help, you are more than welcome !

Thanks... I'll definitely will!

If you need some help for technicalities, subtleties, questions, remarks, please do not hesitate !

That's really nice!! thanks again... I tend to try to figure things out by myself most of the time, but eventually I'll ask you if it's taking too long for me. Really appreciate!

btw, I actually have 2 questions:

could you explain what are the other modes for read_state (full,handsup and quitting)
what's the difference from flush and force write states?

cheers mate! -H

Doloops commented 5 years ago

Hi @hradec,

Seems I just didn't see your questions like 3 months ago... Sorry about that.

I assume you have the answer, but just for completeness (and a future documentation ;)

read_state is as follows: normal : use local files if exist, try to cache files locally otherwise full : local filesystem is full, nothing will be added (see nocache) handsup : do not use the remote filesystem in any case qutting : mcachefs has been asked to unmount, so stop transfer threads

The write states are... Not implemented yet ;) Current status is : write to local, and apply journal.

That would be where a periodic journal apply would occur I guess.

hradec commented 5 years ago

Seems I just didn't see your questions like 3 months ago... Sorry about that.

no worries at all... I known how hectic life can be!!

thanks for the clarification.. Actually, I didn't known about the qutting yet, since I was mostly interested in the handsup one!!

cheers...

Doloops / mcachefs

parallel transfer of files, full directory caching at first access and thread/daemon for backend sync! #10