Open heyciao opened 3 years ago
The issue happened again today while synchronizing my 1st-level backup with my 2nd-level backup (expected, as mentioned above) => I took a backup of the affected files (from both the source and the destination servers, a copy of the file which was being synchronized) before that it was modified by the running rsync => I think that I can now reproduce the problem whenever I want, yuhu.
I initially tried to focus only on "the problem zone" but unluckily that didn't work - meaning that I originally saw that the rsync was hanging at ~35% therefore I initially extracted from the source&destination file(s) only their range of their bytes between 33%-37%, but then when running rsync directly against those small files it ran fine. I then did the same for the range 27%-37% => no problem. I then did the same for the range 10%-40% => kind of smooth, but I think that it stuttered (was hanging for a few seconds) somewhere in the middle. I then did the same for the range 0%-40% => this one is currently hanging (at 89% of the total file size) since ~50 minutes ("hanging" meaning "extremely slow" - the console output does update from time to time but the stated throughput is ~12kB). This would mean that the data that is present in the file before the area where the process starts to become slow does matter?
I can provide both the source & destination files to be used to reproduce the problem, but they're quite big (uncompressed size 20GiB each) and they do contain private informations (ssh-key & DB-passwords => I would have to change them), so I cannot just upload them here.
I can make them available to be downloaded from my private Nextcloud-instance, but would you actually be willing to look at this? I believe that fixing this would be good for a lot of people (many might be affected, maybe only in tiny parts of their files) but might think by mistake that their slow sync speed is caused by disk/network/cpu). If yes then can I somehow send you the link (to my files) privately? (maybe it's obvious how to send private messages on github but I never had to do it so far and so far I don't know how)
Cheers, Stefano
See the other issues about multiple v options. Do you have the same problems when you only use a single "v" for verbosity? We have problems when we use three v. It seems defined for one or two v only.
Yes - I anyway always use only one "v" option and just now I reproduced the problem by using no "v" option at all. The command that I use now to reproduce the problem is:
rsync \
--inplace --partial \
--delete \
-av \
--numeric-ids \
--progress \
-h \
<source_path> \
<server_name>:<target_path>
If I remember correctly the sync does work fine if I do a local sync without going through the network (meaning without "
Rsync is doing checksum searching, which can be slow on a large file. You can use --whole-file
(-W
) to turn off the rsync algorithm when transferring large files (the faster your network is, the more likely that whole-file is faster).
(Hi Wayne - btw. this is my first post here so thanks a lot for your efforts maintaining & improving this great program!! :+1: )
Rsync is doing checksum searching, which can be slow on a large file.
Yep, after something like 50 hours of tests & investigations & scanning forums & desperation I ended up understanding that hehe; but in my case at specific positions of some specific files it becomes ridiculously slow (something like ~1KiB/s, for hours) :P
You can use --whole-file...
So far my workaround was to first manually delete the impacted destination files therefore then triggering a full transfer, and I am planning to basically automate that deletion by using the option --delete-before
(but so far I did not test it).
My understanding is that using --whole file
would have the same result (to just transfer to the destination all the contents of the source files), or am I wrong?
About using workarounds: In general transferring XXX GBs of data is for me not a big deal because I'm very lucky to have an unlimited FTTH connection with symmetric 1Gbps (but with rsync latency is involved so realistically in my case that goes down to ~30MiB/s), but other people are for sure more limited... .
Additionally, the behaviour/problem is quite hard to understand/identify, so the people that are affected might lose their minds while trying to debug their network/OS/disks/whatever thinking that that's the problem instead of rsync itself (if it's really rsync's problem; I'm ~almost sure, but I admit that I cannot exclude that I might still be overlooking something). Over a long term this problem might backfire by spreading as a trend over the whole Internet statements like "rsync is terrible/slow/doesn't work/hangs/etc..." (not completely wrong but as well not completely right as in the cases where users don't strictly monitor their disks/network/etc the issue might be related to their disks/network/etc).
Finally, with the introduction of the new hashing & compression algorithms (xxHash & zstd) rsync'ing big files over the network (with checksum searching) can be eeextremely fast (max achieved so far was ~500MiB/s) => for me it would be a pity if that would be useless because of issues related to the hashing search/delta transfer algo anyway hanging in certain situations :P
Next steps: I think that having a stable set of files that can be used to reproduce the problem would potentially be useful? Therefore, since yesterday I'm trying to anonymize a set of 2 files that have that problem, to then be able to (compress them efficiently to then) make them publicly available.
I therefore wrote a tiny Python script which reads the original files' contents and writes new ones by replacing single bytes with something else.
Yesterday I ran it by leaving all bytes that have a uint8-value of "0" untouched and replaced all other bytes with the uint8-value "1" => when I then ran the test by rsync'ing these anonymized files there was a slowdown (at the usual 98% completion) but nothing dramatic (sync rate dropped from ~160MiB/s to ~30MiB/s) => failure.
Today I'm still leaving all bytes that have a uint8-value of "0" untouched, but I'm replacing all other values by ~random values (random uint8-range between 97 and 122). My script is definitely not great performance-wise so it will need a few more hours to finish writing the new test files, then I'll run as usual a test against them.
If that still cannot reproduce the problem then I'll try to think about something else / some other variant.
Cheers
Using --whole-file
would work like your manually removing files you want to update as long as the mtime and/or size don't match. It avoids all the checksum record matching. The algorithm doesn't hang, it just gets really, really slow when the number of hashed records gets to be huge.
I have the same problem rsyncing qcow2 images.
https://bugzilla.redhat.com/show_bug.cgi?id=2021853
The --whole-file
option is not really interesting in my case, because my bandwidth is constrained, and the destination file is seeded with the previously rsynced qcow2 image from the same vm. So it would be great that rsync could benefit from this apriori knowledge, to only transfer the modified chunks in the disk image.
This is surprising, because for some qmu images, it works without noticeable slow-down:
receiving incremental file list
xxx1.qcow2
9,205,055,488 100% 74.25MB/s 0:01:58 (xfr#1, to-chk=0/1)
sent 761,839 bytes received 773,487,430 bytes 5,395,465.29 bytes/sec
total size is 9,205,055,488 speedup is 11.89
receiving incremental file list
xxx2.qcow2
86,285,549,568 100% 110.90MB/s 0:12:21 (xfr#1, to-chk=0/1)
sent 5,846,304 bytes received 2,030,685,184 bytes 2,091,968.66 bytes/sec
total size is 86,285,549,568 speedup is 42.37
receiving incremental file list
xxx3.qcow2
20,404,830,208 100% 73.33MB/s 0:04:25 (xfr#1, to-chk=0/1)
sent 1,211,787 bytes received 2,045,979,510 bytes 6,427,602.19 bytes/sec
total size is 20,404,830,208 speedup is 9.97
receiving incremental file list
xxx4.qcow2
1,075,334,021,120 100% 123.25MB/s 2:18:40 (xfr#1, to-chk=0/1)
sent 73,832,996 bytes received 1,922,166,800 bytes 158,784.44 bytes/sec
total size is 1,075,334,021,120 speedup is 538.74
receiving incremental file list
xxx5.qcow2
67,859,972,096 28% 0.58kB/s 83687:32:54
Is there a possibility to limit the number of records matching?
I agree with what fbellet wrote. Small update about my attempts to anonymize one of my images to then make it available to reproduce the problem (sorry for the long period of inactivity - I was relocating and I'm finally almost done...):
Next:
If this doesn't work as well then I'll think again about something else... .
Yuhu! It looks like that that last attempt managed to anonymize the files while keeping them able to replicate the issue :)))
Instructions: btw. to host all files you'll need ~52 GiB free space (2x ~6Gib compressed + 2x 20Gib uncompressed).
1) Go here (my own server)... https://share.segfault.digital/index.php/s/maK3dykYr6aDptw ...and download the 2 files that are available (total download size 10.7 GiB!).
The server should be able to transfer with at least ~25MiB/s. My server's upload/download rates are not limited but please still try not to download them multiple times :)
2) Create somewhere the directories to be used for the test sync, for example:
mkdir -p /tmp/test_slow_rsync/source
mkdir -p /tmp/test_slow_rsync/target
3) Assuming that the downloaded files are located in "~/Downloads", extract the files into the test directories:
gunzip -c ~/Downloads/testfile-0_to_40pct-2.anon-source.gz > /tmp/test_slow_rsync/source/testfile-0_to_40pct-2.anon
gunzip -c ~/Downloads/testfile-0_to_40pct-2.anon-target.gz > /tmp/test_slow_rsync/target/testfile-0_to_40pct-2.anon
Warning: each resulting file will be ~20GiB big.
4) Result:
myuser@MYPC /tmp/test_slow_rsync $ tree -as
.
├── [ 4096] source
│ └── [21474836480] testfile-0_to_40pct-2.anon
└── [ 4096] target
└── [21474836480] testfile-0_to_40pct-2.anon
5) Finally, try to rsync them!
rsync -av --numeric-ids --inplace --partial --progress -h /tmp/test_slow_rsync/source/ MYPC:/tmp/test_slow_rsync/target/
When rsync reports 89% completion the sync rate should suddenly crash down to a few KB/s, and rsync should be using mostly only CPU (with a few KiB/s of HDD activity):
testfile-0_to_40pct-2.anon
19.22G 89% 16.02kB/s 39:07:22
Concerning the target of the sync, it should work as well when using "localhost" instead of your hostname (which is what I'm referring to above with "MYPC"). But a target host is in any case needed to replicate the issue.
(replicated by using rsync version 3.2.3-r4 on Gentoo Linux on a laptop + workstation)
Btw. @fbellet maybe you could try to confirm that my 2 files can reproduce the issue as well on your system?
Cheers
Forcing a protocol version downgrade with --remote-option=--protocol=29
seems to improve things a bit, by handling larger blocks as suggested. Block size is ~500kb with protocol 29, instead of 128kb with protocol 31 on a 200GB file to sync.
However, even with these 4x bigger blocks, this is still very slow on huge files (4TB).
Maybe the block size should not have a fixed max value, but instead the number of blocks per file should?
Forcing a protocol version downgrade with
--remote-option=--protocol=29
seems to improve things a bit, by handling larger blocks as suggested. Block size is ~500kb with protocol 29, instead of 128kb with protocol 31 on a 200GB file to sync. However, even with these 4x bigger blocks, this is still very slow on huge files (4TB). Maybe the block size should not have a fixed max value, but instead the number of blocks per file should?
You're right!
When using --remote-option=--protocol=29
starting from 89% when using my test files it still syncs with ~14MB/s instead of the usual few KB/s.
At least this ~workaround does not make it become stuck forever..., thx!
In the future I will probably change the sender to segment really large files into chunks so the transfer will be as efficient as if the large file had been broken-up into several more-manageable files without having to spend I/O on the breakup and reconstruction. i.e., it won't be able to use a checksum match for a late chunk in the file for an early chunk in the file, but it also won't bog down the checksum matching algorithm. The nice thing about this is that this change should be totally contained to the sender logic, where the receiving side doesn't need to know that the sender is limiting its matching into sub-file buckets.
Thanks a lot for thinking about this :) Hoping that I understood you correctly, personally, I would probably like users to be able to optionally set the value of that threshold (e.g. "max_contiguous_chunk/segment_to_sync_at_once"), instead of that being enforced only by the app's code.
Reasons:
Summarized, the usual problem of setting a global hardcoded value which will definitely make somebody unhappy for some reason and others happy for some other reason, and/or a mix of the two depending on the types of files that they're synchronizing (respectively, impossible to reliably define a value which covers all usecases).
Btw., about terminology: maybe "chunk" is already being used by other parts/options of the program -> "segment" and "bucket" sound good, I'm spontaneously throwing into the mix as candidates as well "slice" and "portion". ("portion" is the one that sounds to me the most high-level, but maybe it sounds weird, not sure?)
Disadvantage of the proposed solution is that the transfer can no longer detect blocks that have moved location substantially, e.g. for a VM disk image this can happen when running filesystem maintenance inside a VM such as btrfs balance
or defrag
.
Wouldn't that anyway be a corner case? The last time I defragged a filesystem was probably 20 years ago, and that's probably anyway a big change on a FS-level (meaning that a high cost related to resynchronizing the data at a later stage would be justified). Just my personal opinion :)
On some Linux distributions filesystem maintenance is automatic using https://github.com/kdave/btrfsmaintenance and active by default.
Any updates/news/tests on this issue? I'm pretty interested in syncing big images too.
While we wait for the resolution of rsync's large file handling issue, I've developed an intermediate solution called QuickChunk for my local LAN backup needs. QuickChunk is designed to efficiently handle the synchronization of large files.
You can find more about QuickChunk and its utility here: https://github.com/ch-f/QuickChunk
Hey there,
we are currently implementing a backup system for our Virtual Machine PLOOP images using rsync. Unfortunately, we also seem to be encountering this issue with our larger images (1-5TB).
Initially, the transfer starts off smoothly and quickly, but as the transfer progresses, it slows down, and the CPU load on the source server rsync process gradually approaches 100%. Using latest rsync with all possible optimizations activated except for asm-roll which causes rsync to crash:
# ./rsync -V
rsync version 3.2.7 protocol version 31
Copyright (C) 1996-2022 by Andrew Tridgell, Wayne Davison, and others.
Web site: https://rsync.samba.org/
Capabilities:
64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints,
socketpairs, symlinks, symtimes, hardlinks, hardlink-specials,
hardlink-symlinks, IPv6, atimes, batchfiles, inplace, append, ACLs,
xattrs, optional secluded-args, iconv, prealloc, stop-at, no crtimes
Optimizations:
SIMD-roll, no asm-roll, openssl-crypto, asm-MD5
Checksum list:
xxh128 xxh3 xxh64 (xxhash) md5 md4 sha1 none
Compress list:
zstd lz4 zlibx zlib none
Daemon auth list:
sha512 sha256 sha1 md5 md4
@WayneD - Any chance that the "chunking" will arrive sometime soon? 🤔
I guess/hope this would help big times! 😊
Thank you, bye from Austria Andreas
Hi
Summary / Overview: I use rsync since a long time to sync many ~small individual files (e.g. all the contents of an OS) and few big files (e.g. the "images" of some OSs that I run through QEMU/KVM).
My problem is that rsync hangs when processing some (not all) big files of my OS images; when that happens the process consumes for hours 100% of one of the available CPU cores and does nothing else (at least nothing related to disk nor network that I can see).
This happens since many years, respectively, previous versions.
I did find posts in the Internet about people stating that they had the same problem, but often those posts were indirectly related to other problems (e.g. ssh-connection being cut, some problems with the filesystem that was being used, missing free space, whatever...), so there seems to be a lot of "noise" in this context => still, personally I think that ~5% of those posts are experiencing the same problem that I have.
I remember that in the past (1 up to 3 or more years ago), when this happened, a few times I let it run throughout the night and then in the morning I saw that it managed to finish or to at least to go forward with its processing, so apparently the problem is not that it hangs forever but just that it takes ages to do (whatever it wants to do).
As a workaround I usually just first delete the already existing files (then it always works), but that's not very nice as I then have to transfer again all the contents... .
~recently rsync (as the program/project) had some activity (e.g. integrated zstd support, just great & fantastic!!! Thank you :) ) and I was hoping that the problem would then be solved, but during the last syncs all this happened again and again, which is why I decided to ask here for help... .
Today's example 1- Today I executed rsync as follows (to sync my big OS images used by my remote server running QEMU/KVM):
I then executed on the remote server...
strace -t -p 5719 > rsync-strace-no_vvvv-02.txt 2>&1
...(5719 was the PID of the rsync process that was running).2- It sync'ed successfully the first file (25GiB). Then it synced quickly&successfully the first ~34% of the second file (50GiB) and then it hanged => as usual on the remote server 1 CPU core was used 100% and I could not see any obvious disk ("nmon" running on the remote server) nor network ("gkrellm" running on my local server) activity.
3- The output of "strace" on the remote server (that was using 100% of 1 CPU core) against the rsync-process showed a lot of different stuff before that rsync started hanging, and when it started hanging it started showing only "read"-syscalls (I'm not a "pro" - I only read that strace apparently shows only "syscalls"?)...
...and then it went on like that for a looong time...
...and then I killed rsync.
4- Before that I tried running rsync with "-vvvv" (and I redirected its stdout/err to a file) => that generated a file of 25GiB (before that I then killed the rsync process) => if I look at that output this area of the output seems to show the transition from the "normal" activity of rsync to what I have posted above:
...and so on for a few kilometers.
Remark about the "sum=00000000" IF the "sum=00000000" is related to some kind of checksum of the data that is being sync'ed (and this is a wild guess / pure speculation, I have absolutely no clue), then that might make sense as I usually create the initial raw image-files that are then used to install OSs for example as follows:
dd if=/dev/zero of=my_image_file.img bs=1M count=10240
Therefore, all OS images that I'm synchronizing are likely to contain long sequences of bytes all set to 0 (if whatever is running in that OS in the meanwhile did not overwrite that with some other stuff).Mixed facts/observations and thoughts
Would it please be possible to look at this? I have exhausted all options related to rsync-parameters that could potentially offer a workaround for this problem, and doing a full download of XX GBs each time I take a backup makes me feel sad, hehe (but it's true!).
Does anybody else feel like having the same problem? (if yes then pls. post here, but only after being 99% sure that what I wrote matches your own problem, thank you :P )
Cheers, from Switzerland! :)