cberner / fuser

Filesystem in Userspace (FUSE) for Rust
MIT License
835 stars 114 forks source link

Multithreading Support? #272

Open devmaxde opened 10 months ago

devmaxde commented 10 months ago

Hey, I was wondering, if there is a Multithreading support in the current Version. I was experimenting with the examples/simple.rs and realized, that copying a large amount of files was really slow. My Computer was running 1CPU on 100% all the time. Is there a solution to improve the performance?

cberner commented 10 months ago

I think it might be possible: https://john-millikin.com/the-fuse-protocol#multi-threading

I haven't tried to implement it though

devmaxde commented 10 months ago

Will you implement it in the near Future? I don't think, that I'll be able to contribute that. My knowledge is to small.

cberner commented 10 months ago

No. It'll be a long time, if ever, before I get to it

Sent from my phone

On Fri, Dec 29, 2023, 5:59 AM Martin Maximilian Kutschka von Rothenfels < @.***> wrote:

Will you implement it in the near Future? I don't think, that I'll be able to contribute that. My knowledge is to small.

— Reply to this email directly, view it on GitHub https://github.com/cberner/fuser/issues/272#issuecomment-1872110579, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGNXQALERHP2NNX4PEQF7DYL3ECXAVCNFSM6AAAAABBF67ZI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZSGEYTANJXHE . You are receiving this because you commented.Message ID: @.***>

hopkings2008 commented 3 months ago

Hi, Will we consideration about the multithread version of this fuser lib? Currently, there is a performance problem due to the single message loop from /dev/fuse 'socket', and the single message loop can only run in the single core, in our tests, we use fio to test the fuser framework to get the bandwidth, we use the simple test and in each write api, we just return directly without actually write the data to underlying file system, the bandwidth is limited in the single core of the test machine, and it cannot double according to the number of fio jobs when that core reaches 100%. If we implement the multithread version of the message loop, we can spread the load into multiple cores of the machine, and we can increate the throughput then.

volfco commented 3 months ago

I've started work on adding Multi-threading support in my fork here: https://github.com/volfco/fuser

hopkings2008 commented 2 months ago

I've started work on adding Multi-threading support in my fork here: https://github.com/volfco/fuser

thanks a lot, checking it now.

volfco commented 2 months ago

Started a MR to close this.

https://github.com/cberner/fuser/pull/293

mash-graz commented 2 months ago

Started a MR to close this.

does this suggested solution have any significant benefit compared to the fuser based fuse-mt crate?

I really would like this kind of feature, but i'm also afraid of even more fragmentation realized by all this [partial working] fuse implementations. It's really hard to decide, which one to choose for performance critical applications. fuser is very likely the most well maintained rust solution right now, but the async oriented fuser3 and fuse-mt are perhaps more suitable for this kind of requirement...

hopkings2008 commented 2 months ago

Started a MR to close this.

does this suggested solution have any significant benefit compared to the fuser based fuse-mt crate?

I really would like this kind of feature, but i'm also afraid of even more fragmentation realized by all this [partial working] fuse implementations. It's really hard to decide, which one to choose for performance critical applications. fuser is very likely the most well maintained rust solution right now, but the async oriented fuser3 and fuse-mt are perhaps more suitable for this kind of requirement...

Hi Mash, i think that the fuse-mt crate doesn't improve the throughput of the fuser crate, because the session used by fuse-mt crate is the same as the fuser crate. The critical problem is that the file handle from /dev/fuse for the session is the single one, and because of this, even we use multi-threading based on the single session handle, we cannot improve the throughput. And what this PR does is to use the cloned handle from /dev/fuse provided by kernel to improve the throughput

volfco commented 2 months ago

does this suggested solution have any significant benefit compared to the fuser based fuse-mt crate?

fuse-mt is single-threaded when communicating with the kernel; same as fuser3. They both send the kernel request to a threadpool where multi-threaded processing can happen, but I would not call either of the true multi-threaded.

afraid of even more fragmentation

From what I saw, there is zero standard way libraries handle multi-threading (if they try at all). The best async implementation I've seen is https://github.com/datenlord/datenlord/blob/master/src/async_fuse, but they're not shipping it as a standalone crate.

fuse3 would require a bit of work to re-work to support multi-threaded kernel communication. For my uses, async is less than idea, which is why I avoided looking more at fuse3

mash-graz commented 2 months ago

Thanks for your explanations.

In the meanwhile I also took a look at the relevant source files.

The approach suggested here and used in the datenlord implementation looks very similar to the strategy described in https://john-millikin.com/the-fuse-protocol#multi-threading. This may have the benefit, that it will just reproduce conventions already used in other traditional C implementations.

fuse-mt is taking a slightly different path. It's not cloning the whole Filesystem interface on all worker threads, but only executes the four most likely blocking functions (read, write, flush, fsync) inside a thread pool. The other routines and the whole session management are still running in single threaded manner.

I'm not sure, if this 2nd variant is really slower in practice?

Very likely it will not cover as much corner cases and possible application specific demands as the other one, because only those four functions are wrapped into concurrency improvements, but on the other hand it may reduce the synchronization overheap and code complexity of all other calls.

In my specific use case multi threading is anyway only a rather simple kind of improvement but no perfect solution. It makes more sense in those cases, where you really run lots of computation within the application itself. But if you are just waiting for other external processes and similar kinds of blocking lightweight async solutions resp. their generated state machines would be a more desirable model to handle concurrency. But we all know, that this variant in real live doesn't come without other significant drawbacks...

hopkings2008 commented 2 months ago

Hi Mash, Does fuse-mt use the same session from the /dev/fuse for the filesystem interface including those four blocking functions(read, write flush, fsync)?, if so, it will not improve the throughput of the whole file system, because this kind of single session mechanism can only use the single core of the system. And the throughput is bounded by the single core, and it cannot perform multi-core expansion. So the whole throughput is limited. The single session with thread pooling process logic in upper layer is only single core bounded.

mash-graz commented 2 months ago

Does fuse-mt use the same session from the /dev/fuse for the filesystem interface including those four blocking functions(read, write flush, fsync)?

Yes -- it's using the old mount() of fuser, which will establish the Session context and dispatching of incoming requests from the fuse kernel module.

if so, it will not improve the throughput of the whole file system, because this kind of single session mechanism can only use the single core of the system.

No -- I don't agree. fuse-mt is indeed using multiple threads resp. cores of your machine for the compute intensive four request handlers. They are executed on the threadpool within the FileSystem trait. (e.g. in this write routine). This is possible, because you are allowed to split the connection of the low level Request / Response pair in the Filesystem handlers and finalize the task later in another thread. It's astonishing simple!

But is this solution really the key to more throughput?

The relevant papers don't give an answer because multithreading is just one improvement beside splice, more transfer blocks etc. in their comparisons (e.g. https://www.fsl.cs.stonybrook.edu/docs/fuse/bharath-msthesis.pdf). Utilizing multiple threads in parallel may improve some kinds of access, but will not help in others. Context switches are really expensive and may quickly eliminate the expected benefits.

It's interesting, that the fuse-mt developer even experimented with similar simple async solutions (see: https://github.com/wfraser/fuse-mt/issues/3) to optimize this aspect.

The required adaptations look rather small. That's somehow encouraging, because I would really like to base my work on a mature, widely used and well maintained crate like fuser, but also be able to adapt it to my needs in a similar manner. It's perhaps more a question of missing documentation and examples...

hopkings2008 commented 2 months ago

Hi Mash, One test can be made for the throughput of the framework such as fuser or fuse-mt by just returning the number of written bytes on the write interface. we can check that the throughput of the write of the framework is limitted by the single session from the /dev/fuse which is irrelevant to how many threads for requests handler.

volfco commented 2 months ago

fuse-mt bills itself as a high level library while saying [fuser] is the low level library, so I don't think they can be compared.

But is this solution really the key to more throughput?

The primary motivation is to have multiple independent I/O streams between the userspace and kernel over different file descriptors.

fuse-mt is still limited by a lock on the file descriptor as multiple threads will be trying to use the same descriptor.

Utilizing multiple threads in parallel may improve some kinds of access, but will not help in others. Context switches are really expensive and may quickly eliminate the expected benefits.

My eventual goal is to be able to have my filesystem pin itself and the associated kernel thread to a specific core (which it would have exclusive access to. I'm assuming this will enable higher performance [because why not lol]), but even if it's a 20% gain- that's still something worth having.