Open kishorenc opened 2 years ago
I think there is no other way to speed up snapshot transfer, the files in snapshot are transferred sequentially.
Thanks for the clarification. Would you accept a patch that possibly parallelized this operation?
Thanks for the clarification. Would you accept a patch that possibly parallelized this operation?
Of course!
@PFZheng
Here's a proposed approach that I want to run by you before making changes:
LocalSnapshotCopier::copy_file
take a vector of filename
s: https://github.com/baidu/braft/blob/2c9f611ad916b34833aeb2a011538dad9e957ade/src/braft/snapshot.cpp#L946_copier
and _cur_session
into a struct and use a vector of structs so that each copy thread can have a separate copy state.Please let me know if this sounds like a good approach? Also, is it okay to use std::thread
-- if not, please point me to some examples in the code where threads are managed via bthread
to use as reference.
@PFZheng
Took a stab at this but ran into concurrency issues with FileServiceImpl::get_file
which seems to have trouble allowing fetching files in parallel. Specifically, we got an error here when we tried to download multiple snapshot files in-parallel:
https://github.com/baidu/braft/blob/master/src/braft/file_service.cpp#L70
Here's the rough diff of the changes we've attempted: https://github.com/baidu/braft/compare/master...krunal1313:braft:master
Any guidance here is appreciated.
cc @chenzhangyi
@PFZheng @chenzhangyi
I'm sorry to follow-up: I will really appreciate if you can provide any pointers here.
@kishorenc Which kind of error did you get?
This is the error we get with these changes.
W0123 12:03:47.506687 80182 external/com_github_brpc_braft/src/braft/snapshot.cpp:786]
Fail to copy, error_code 22 error_msg [E22][10.13.87.194:8107][E22]Fail to read from path=/var/lib/app/state/snapshot/snapshot_00000000000000000369 filename=db_snapshot/000444.sst :
Invalid argument writer path /var/lib/app/state/snapshot/temp
It seems like, on the remote end, multiple files cannot be accessed at the same time.
I've a cluster where the nodes are far apart geographically, so there is high network latency between the nodes. In such a case, I find the snapshot install from the leader to follower to be pretty slow for large datasets.
When I increase the
FLAGS_raft_max_byte_count_per_rpc
value, the trasfer became much faster.