rsync hangs with some specific files, using 100% of 1 CPU core but without disk nor network activity

heyciao commented 3 years ago

Hi

Summary / Overview: I use rsync since a long time to sync many ~small individual files (e.g. all the contents of an OS) and few big files (e.g. the "images" of some OSs that I run through QEMU/KVM).

My problem is that rsync hangs when processing some (not all) big files of my OS images; when that happens the process consumes for hours 100% of one of the available CPU cores and does nothing else (at least nothing related to disk nor network that I can see).

This happens since many years, respectively, previous versions.

I did find posts in the Internet about people stating that they had the same problem, but often those posts were indirectly related to other problems (e.g. ssh-connection being cut, some problems with the filesystem that was being used, missing free space, whatever...), so there seems to be a lot of "noise" in this context => still, personally I think that ~5% of those posts are experiencing the same problem that I have.

I remember that in the past (1 up to 3 or more years ago), when this happened, a few times I let it run throughout the night and then in the morning I saw that it managed to finish or to at least to go forward with its processing, so apparently the problem is not that it hangs forever but just that it takes ages to do (whatever it wants to do).

As a workaround I usually just first delete the already existing files (then it always works), but that's not very nice as I then have to transfer again all the contents... .

~recently rsync (as the program/project) had some activity (e.g. integrated zstd support, just great & fantastic!!! Thank you :) ) and I was hoping that the problem would then be solved, but during the last syncs all this happened again and again, which is why I decided to ask here for help... .

Today's example 1- Today I executed rsync as follows (to sync my big OS images used by my remote server running QEMU/KVM):

rsync --numeric-ids --progress -h \
--inplace \
--partial \
--compress \
--compress-choice=zstd \
--compress-level=3 \
--checksum-choice=xxh3 \
--progress -h -ax \
-v \
$REMOTE_PORT \
$REMOTE_SERVER_AND_DIR $LOCAL_DIR

I then executed on the remote server... strace -t -p 5719 > rsync-strace-no_vvvv-02.txt 2>&1 ...(5719 was the PID of the rsync process that was running).

2- It sync'ed successfully the first file (25GiB). Then it synced quickly&successfully the first ~34% of the second file (50GiB) and then it hanged => as usual on the remote server 1 CPU core was used 100% and I could not see any obvious disk ("nmon" running on the remote server) nor network ("gkrellm" running on my local server) activity.

3- The output of "strace" on the remote server (that was using 100% of 1 CPU core) against the rsync-process showed a lot of different stuff before that rsync started hanging, and when it started hanging it started showing only "read"-syscalls (I'm not a "pro" - I only read that strace apparently shows only "syscalls"?)...

...
14:07:07 read(3, "\207\235\244&d`\365\247q\203\261\221\247\251\355V\211\205\10\300\234\307Y\2173sV\276\353\32
3!\260"..., 262144) = 262144
14:07:07 read(3, "\351\3\0\0\0\0\0\0\370\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144
14:07:07 read(3, "/kUV\315\315\233 \2327\37\232g\274]\363\4\363~r\374\0276\262\324\272\356\242u\3\241u"..., 2
62144) = 262144
14:07:07 read(3, "\313\0\0\0\0\0\0\0\0\0\0\0\314\200\25\0\215\210\211\6\0\0\0\0\320\200\30\0\205\204\211\2"..
., 262144) = 262144
14:07:07 read(3, "g4X=0\334\2506\364\322\325\26\374B\2519\274\200 \223\247\265\341\235\370\375e. \347\300\377
"..., 262144) = 262144
14:07:07 read(3, "`\0\r\0\n\200K\0\0\0\0\0\222\0!\2\0\200\5\4\0\0\0\0|\320\30\0\205\4\0\0"..., 262144) = 2621
44
14:07:07 read(3, "\1^\f\255\267\4r\215\366\373\324\16\264a\26\303\tO\260\201~\373\27j\344\233_&\273Gm\23"...,
 262144) = 262144
14:07:07 mmap(NULL, 3653632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb4f76a2000
14:07:07 read(3, "ys/CMakeFiles/mysys.dir/my_getwd"..., 99328) = 99328
14:07:07 pselect6(2, [0], [1], [0], {tv_sec=60, tv_nsec=0}, NULL) = 2 (in [0], out [1], left {tv_sec=59, tv_n
sec=999998333})
14:07:07 read(0, "\357\vc\227`3\261$F\f\337\224B\262afX\305zt\230\177\366c\313\351\310\317!@=\t"..., 224) = 2
24
14:07:07 write(1, "\37\300\0\7\3\10\200\0@\6\0\0\0\2\0\4\0\0\0\0\0\0\0\300\377\377\301\377\377\301z<"..., 491
87) = 49187
14:07:07 pselect6(2, [], [1], [], {tv_sec=60, tv_nsec=0}, NULL) = 1 (out [1], left {tv_sec=59, tv_nsec=999999
153})
14:07:07 write(1, "\3\300\0\7\177\377n\302\307q\304\336\247\207i\303\366iR\31\0315\266f\360\205\255T;\215\373
\367"..., 49159) = 45063
14:07:07 read(3, " = true;\n          ^\n/var/tmp/po"..., 98304) = 98304
14:07:07 read(3, "9-r1/work/mysql/include -I/var/t"..., 98304) = 98304
14:07:07 read(3, "11111111111111111111111111111111"..., 98304) = 98304
14:07:07 read(3, "11111111111111111111111111111111"..., 98304) = 98304
14:08:05 read(3, "11111111111111111111111111111111"..., 98304) = 98304
14:09:02 read(3, "11111111111111111111111111111111"..., 98304) = 98304
14:10:00 read(3, "11111111111111111111111111111111"..., 98304) = 98304
14:10:58 read(3, "11111111111111111111111111111111"..., 98304) = 98304
14:11:55 read(3, "11111111111111111111111111111111"..., 98304) = 98304
14:12:53 read(3, "11111111111111111111111111111111"..., 98304) = 98304
14:13:51 read(3, "11111111111111111111111111111111"..., 98304) = 98304
14:14:48 read(3, "11111111111111111111111111111111"..., 98304) = 98304

...and then it went on like that for a looong time...

14:15:37 read(3, "22222222222222222222222222222222"..., 98304) = 98304
14:15:37 read(3, "22222222222222222222222222222222"..., 98304) = 98304
14:16:24 read(3, "22222222222222222222222222222222"..., 98304) = 98304
14:17:22 read(3, "22222222222222222222222222222222"..., 98304) = 98304
14:18:20 read(3, "22222222222222222222222222222222"..., 98304) = 98304
14:19:17 read(3, "22222222222222222222222222222222"..., 98304) = 98304
14:20:15 read(3, "22222222222222222222222222222222"..., 98304) = 98304
14:21:13 read(3, "22222222222222222222222222222222"..., 98304) = 98304
14:22:10 read(3, "22222222222222222222222222222222"..., 98304) = 98304
14:23:08 read(3, "22222222222222222222222222222222"..., 98304) = 98304
14:24:06 read(3, "22222222222222222222222222222222"..., 98304) = 98304
14:24:07 read(3, "33333333333333333333333333333333"..., 98304) = 98304
14:24:44 read(3, "33333333333333333333333333333333"..., 98304) = 98304
14:25:42 read(3, "33333333333333333333333333333333"..., 98304) = 98304
14:26:40 read(3, "33333333333333333333333333333333"..., 98304) = 98304
14:27:37 read(3, "33333333333333333333333333333333"..., 98304) = 98304
14:28:35 read(3, "33333333333333333333333333333333"..., 98304) = 98304
14:29:33 read(3, "33333333333333333333333333333333"..., 98304) = 98304
14:30:31 read(3, "33333333333333333333333333333333"..., 98304) = 98304
14:31:29 read(3, "33333333333333333333333333333333"..., 98304) = 98304
14:32:26 read(3, "33333333333333333333333333333333"..., 98304) = 98304
14:32:37 read(3, "44444444444444444444444444444444"..., 98304) = 98304
14:33:05 read(3, "44444444444444444444444444444444"..., 98304) = 98304
14:34:02 read(3, "44444444444444444444444444444444"..., 98304) = 98304
14:35:00 read(3, "44444444444444444444444444444444"..., 98304) = 98304
14:35:58 read(3, "44444444444444444444444444444444"..., 98304) = 98304
14:36:55 read(3, "44444444444444444444444444444444"..., 98304) = 98304
14:37:53 read(3, "44444444444444444444444444444444"..., 98304) = 98304
14:38:51 read(3, "44444444444444444444444444444444"..., 98304) = 98304
14:39:48 read(3, "44444444444444444444444444444444"..., 98304) = 98304
14:40:46 read(3, "44444444444444444444444444444444"..., 98304) = 98304
14:41:07 read(3, "55555555555555555555555555555555"..., 98304) = 98304
14:41:24 read(3, "55555555555555555555555555555555"..., 98304) = 98304
14:42:22 read(3, "55555555555555555555555555555555"..., 98304) = 98304
14:43:20 read(3, "55555555555555555555555555555555"..., 98304) = 98304
14:44:17 read(3, "55555555555555555555555555555555"..., 98304) = 98304
14:45:15 read(3, "55555555555555555555555555555555"..., 98304) = 98304
14:46:13 read(3, "55555555555555555555555555555555"..., 98304) = 98304
14:47:11 read(3, "55555555555555555555555555555555"..., 98304) = 98304
14:48:09 read(3, "55555555555555555555555555555555"..., 98304) = 98304
14:49:07 read(3, "55555555555555555555555555555555"..., 98304) = 98304
14:49:37 read(3, "66666666666666666666666666666666"..., 98304) = 98304
14:49:45 read(3, "66666666666666666666666666666666"..., 98304) = 98304
14:50:43 read(3, "66666666666666666666666666666666"..., 98304) = 98304
14:51:40 read(3, "66666666666666666666666666666666"..., 98304) = 98304
14:52:38 read(3, "66666666666666666666666666666666"..., 98304) = 98304
14:53:36 read(3, "66666666666666666666666666666666"..., 98304) = 98304
14:54:34 read(3, "66666666666666666666666666666666"..., 98304) = 98304
14:55:32 read(3, "66666666666666666666666666666666"..., 98304) = 98304
14:56:30 read(3, "66666666666666666666666666666666"..., 98304) = 98304
14:57:28 read(3, "66666666666666666666666666666666"..., 98304) = 98304
14:58:09 read(3, "77777777777777777777777777777777"..., 98304) = 98304
14:58:09 read(3, "77777777777777777777777777777777"..., 98304) = 98304
14:59:05 read(3, "77777777777777777777777777777777"..., 98304) = 98304
15:00:03 read(3, "77777777777777777777777777777777"..., 98304) = 98304
15:01:00 read(3, "77777777777777777777777777777777"..., 98304) = 98304
15:01:58 read(3, "77777777777777777777777777777777"..., 98304) = 98304
15:02:56 read(3, "77777777777777777777777777777777"..., 98304) = 98304
15:03:54 read(3, "77777777777777777777777777777777"..., 98304) = 98304
15:04:52 read(3, "77777777777777777777777777777777"..., 98304) = 98304
15:05:50 read(3, "77777777777777777777777777777777"..., 98304) = 98304
15:06:41 read(3, "88888888888888888888888888888888"..., 98304) = 98304
15:06:41 read(3, "88888888888888888888888888888888"..., 98304) = 98304
15:07:26 read(3, "88888888888888888888888888888888"..., 98304) = 98304
15:08:24 read(3, "88888888888888888888888888888888"..., 98304) = 98304
15:09:22 read(3, "88888888888888888888888888888888"..., 98304) = 98304
15:10:20 read(3, "88888888888888888888888888888888"..., 98304) = 98304
15:11:18 read(3, "88888888888888888888888888888888"..., 98304) = 98304
15:12:15 read(3, "88888888888888888888888888888888"..., 98304) = 98304
15:13:13 read(3, "88888888888888888888888888888888"..., 98304) = 98304
15:14:11 read(3, "88888888888888888888888888888888"..., 98304) = 98304
15:15:09 read(3, "88888888888888888888888888888888"..., 98304) = 98304
15:15:12 read(3, "99999999999999999999999999999999"..., 98304) = 98304
15:15:47 read(3, "99999999999999999999999999999999"..., 98304) = 98304
15:16:45 read(3, "99999999999999999999999999999999"..., 98304) = 98304
15:17:43 read(3, "99999999999999999999999999999999"..., 98304) = 98304
15:18:41 read(3, "99999999999999999999999999999999"..., 98304) = 98304
15:19:39 read(3, "99999999999999999999999999999999"..., 98304) = 98304
15:20:37 read(3, "99999999999999999999999999999999"..., 98304) = 98304
15:21:35 read(3, "99999999999999999999999999999999"..., 98304) = 98304
15:22:33 read(3, "99999999999999999999999999999999"..., 98304) = 98304
15:23:31 read(3, "99999999999999999999999999999999"..., 98304) = 98304
15:23:44 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
15:24:09 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
15:25:07 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
15:26:05 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
15:27:02 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
15:28:00 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
15:28:58 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
15:29:56 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
15:30:54 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
15:31:52 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
15:32:15 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
15:32:30 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
15:33:28 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
15:34:26 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
15:35:24 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
15:36:22 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
15:37:20 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
15:38:18 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
15:39:15 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
15:40:13 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
15:40:47 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
15:40:52 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
15:41:50 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
15:42:48 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
15:43:46 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
15:44:44 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
15:45:41 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
15:46:39 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
15:47:37 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
15:48:35 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
15:49:19 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
15:49:19 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
15:50:12 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
15:51:10 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
15:52:07 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
15:53:05 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
15:54:03 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
15:55:01 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
15:55:59 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
15:56:57 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
15:57:50 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
15:57:50 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
15:58:33 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
15:59:31 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
16:00:29 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
16:01:27 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
16:02:25 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
16:03:23 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
16:04:21 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
16:05:18 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
16:06:16 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
16:06:22 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:06:55 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:07:53 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:08:51 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:09:48 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:10:46 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:11:44 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:12:42 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:13:40 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:14:38 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:14:54 read(3, "gggggggggggggggggggggggggggggggg"..., 98304) = 98304
16:15:17 read(3, "gggggggggggggggggggggggggggggggg"..., 98304) = 98304
16:16:15 read(3, "gggggggggggggggggggggggggggggggg"..., 98304) = 98304
16:17:12 read(3, "gggggggggggggggggggggggggggggggg"..., 98304) = 98304
16:18:10 read(3, "gggggggggggggggggggggggggggggggg"..., 98304) = 98304
16:19:08 read(3, "gggggggggggggggggggggggggggggggg"..., 98304) = 98304
16:20:06 read(3, "gggggggggggggggggggggggggggggggg"..., 98304) = 98304
16:21:04 read(3, "gggggggggggggggggggggggggggggggg"..., 98304) = 98304
16:22:02 read(3, "gggggggggggggggggggggggggggggggg"..., 98304) = 98304
16:23:00 read(3, "gggggggggggggggggggggggggggggggg"..., 98304) = 98304
16:23:26 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:23:39 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:24:36 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:25:34 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:26:32 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:27:30 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:28:28 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:29:26 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:30:24 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:31:22 read(3, "ffffffffffffffffffffffffffffffff"..., 98304) = 98304
16:31:58 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
16:32:00 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
16:32:58 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
16:33:56 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
16:34:54 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
16:35:52 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
16:36:50 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
16:37:48 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
16:38:45 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
16:39:43 read(3, "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"..., 98304) = 98304
16:40:29 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
16:40:29 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
16:41:20 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
16:42:18 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
16:43:16 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
16:44:14 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
16:45:11 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
16:46:09 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
16:47:07 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
16:48:05 read(3, "dddddddddddddddddddddddddddddddd"..., 98304) = 98304
16:49:01 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
16:49:01 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
16:49:41 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
16:50:39 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
16:51:37 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
16:52:35 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
16:53:33 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
16:54:31 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
16:55:29 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
16:56:27 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
16:57:25 read(3, "cccccccccccccccccccccccccccccccc"..., 98304) = 98304
16:57:33 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
16:58:03 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
16:59:01 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
16:59:59 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
17:00:57 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
17:01:55 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
17:02:53 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
17:03:51 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
17:04:48 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
17:05:47 read(3, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"..., 98304) = 98304
17:06:05 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
17:06:25 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
17:07:23 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
17:08:21 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
17:09:19 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
17:10:17 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
17:11:15 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
17:12:13 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
17:13:10 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
17:14:08 read(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 98304) = 98304
17:14:36 read(3, "99999999999999999999999999999999"..., 98304) = 98304
17:14:47 read(3, "99999999999999999999999999999999"..., 98304) = 98304
17:15:45 read(3, "99999999999999999999999999999999"..., 98304) = 9830

...and then I killed rsync.

4- Before that I tried running rsync with "-vvvv" (and I redirected its stdout/err to a file) => that generated a file of 25GiB (before that I then killed the rsync process) => if I look at that output this area of the output seems to show the transition from the "normal" activity of rsync to what I have posted above:

...
chunk[145406] of size 131072 at 19058655232 offset=19058655232 (seek)
chunk[145407] of size 131072 at 19058786304 offset=19058786304 (seek)
chunk[145408] of size 131072 at 19058917376 offset=19058917376 (seek)
^M         17.75G  35%  639.71MB/s    0:00:52  
chunk[145409] of size 131072 at 19059048448 offset=19059048448 (seek)
chunk[145410] of size 131072 at 19059179520 offset=19059179520 (seek)
chunk[145411] of size 131072 at 19059310592 offset=19059310592 (seek)
chunk[145412] of size 131072 at 19059441664 offset=19059441664 (seek)
chunk[145413] of size 131072 at 19059572736 offset=19059572736 (seek)
chunk[145414] of size 131072 at 19059703808 offset=19059703808 (seek)
chunk[145415] of size 131072 at 19059834880 offset=19059834880 (seek)
chunk[145416] of size 131072 at 19059965952 offset=19059965952 (seek)
chunk[145417] of size 131072 at 19060097024 offset=19060097024 (seek)
chunk[145418] of size 131072 at 19060228096 offset=19060228096 (seek)
chunk[145419] of size 131072 at 19060359168 offset=19060359168 (seek)
chunk[145420] of size 131072 at 19060490240 offset=19060490240 (seek)
chunk[145421] of size 131072 at 19060621312 offset=19060621312 (seek)
chunk[145422] of size 131072 at 19060752384 offset=19060752384 (seek)
chunk[145423] of size 131072 at 19060883456 offset=19060883456 (seek)
chunk[145424] of size 131072 at 19061014528 offset=19061014528 (seek)
chunk[145425] of size 131072 at 19061145600 offset=19061145600 (seek)
chunk[145426] of size 131072 at 19061276672 offset=19061276672 (seek)
chunk[145427] of size 131072 at 19061407744 offset=19061407744 (seek)
chunk[145428] of size 131072 at 19061538816 offset=19061538816 (seek)
chunk[145429] of size 131072 at 19061669888 offset=19061669888 (seek)
chunk[145430] of size 131072 at 19061800960 offset=19061800960 (seek)
chunk[145431] of size 131072 at 19061932032 offset=19061932032 (seek)
chunk[145432] of size 131072 at 19062063104 offset=19062063104 (seek)
chunk[145433] of size 131072 at 19062194176 offset=19062194176 (seek)
chunk[145434] of size 131072 at 19062325248 offset=19062325248 (seek)
...
chunk[146548] of size 131072 at 19208339456 offset=19208339456 (seek)
chunk[146549] of size 131072 at 19208470528 offset=19208470528 (seek)
chunk[146550] of size 131072 at 19208601600 offset=19208601600 (seek)
chunk[146551] of size 131072 at 19208732672 offset=19208732672 (seek)
chunk[146552] of size 131072 at 19208863744 offset=19208863744 (seek)
chunk[146553] of size 131072 at 19208994816 offset=19208994816 (seek)
chunk[146554] of size 131072 at 19209125888 offset=19209125888 (seek)
data recv 16374 at 19209256960
data recv 16383 at 19209273334
data recv 16383 at 19209289717
data recv 16383 at 19209306100
data recv 16383 at 19209322483
data recv 16383 at 19209338866
potential match at 19209524449 i=133081 sum=2bbf3be9
potential match at 19209541659 i=123214 sum=40315c36
potential match at 19209549703 i=263696 sum=07d51764
potential match at 19209583278 i=142238 sum=aca447ac
potential match at 19209590060 i=36171 sum=3a7025da
potential match at 19209592774 i=229736 sum=00fd2866
potential match at 19209622752 i=199759 sum=a906fb51
potential match at 19209624747 i=117000 sum=b02602dd
potential match at 19209630314 i=55353 sum=2b03f39c
potential match at 19209637239 i=122913 sum=e1b9ad15
potential match at 19209653450 i=354036 sum=89d9d8c9
potential match at 19209656027 i=17698 sum=df97a800
potential match at 19209668738 i=388854 sum=1ac6ef7f
potential match at 19209687917 i=135737 sum=91a49faa
potential match at 19209714503 i=17472 sum=c6e896d9
potential match at 19209731372 i=4251 sum=e99eb8da
potential match at 19209747661 i=391475 sum=fbf86434
potential match at 19209747661 i=319667 sum=fbf86434
potential match at 19209770377 i=354244 sum=201f138b
potential match at 19209789424 i=400588 sum=81703933
potential match at 19209790599 i=133903 sum=bf3e818f
potential match at 19209792091 i=105372 sum=36aba775
potential match at 19209805217 i=54383 sum=8316eabc
potential match at 19209820809 i=199053 sum=b56b1d22
potential match at 19209830904 i=159397 sum=1a5c3320
potential match at 19209859103 i=141274 sum=fff0bb10
potential match at 19209892113 i=383938 sum=ea0a407c
potential match at 19209912503 i=404277 sum=00000000
potential match at 19209912503 i=398064 sum=00000000
potential match at 19209912503 i=398063 sum=00000000
potential match at 19209912503 i=398062 sum=00000000
potential match at 19209912503 i=398061 sum=00000000
potential match at 19209912503 i=398060 sum=00000000
potential match at 19209912503 i=398059 sum=00000000
potential match at 19209912503 i=398058 sum=00000000
potential match at 19209912503 i=398057 sum=00000000
potential match at 19209912503 i=398056 sum=00000000
potential match at 19209912503 i=398055 sum=00000000
potential match at 19209912503 i=398054 sum=00000000
potential match at 19209912503 i=398053 sum=00000000
potential match at 19209912503 i=398052 sum=00000000
potential match at 19209912503 i=398051 sum=00000000
potential match at 19209912503 i=398050 sum=00000000
potential match at 19209912503 i=398032 sum=00000000
potential match at 19209912503 i=398031 sum=00000000
...
potential match at 19209912503 i=310742 sum=00000000
potential match at 19209912503 i=310741 sum=00000000
potential match at 19209912503 i=310740 sum=00000000
potential match at 19209912503 i=310739 sum=00000000
potential match at 19209912503 i=310738 sum=00000000
potential match at 19209912503 i=310737 sum=00000000
potential match at 19209912503 i=310736 sum=00000000
potential match at 19209912503 i=310735 sum=00000000
potential match at 19209912503 i=310734 sum=00000000
...(jumping to the position of 50% of the file)...
potential match at 19209914606 i=229351 sum=00000000
potential match at 19209914606 i=229350 sum=00000000
potential match at 19209914606 i=229349 sum=00000000
potential match at 19209914606 i=229348 sum=00000000
potential match at 19209914606 i=229347 sum=00000000
potential match at 19209914606 i=229346 sum=00000000
potential match at 19209914606 i=229345 sum=00000000
potential match at 19209914606 i=229344 sum=00000000
potential match at 19209914606 i=229343 sum=00000000
potential match at 19209914606 i=229342 sum=00000000
potential match at 19209914606 i=229341 sum=00000000
potential match at 19209914606 i=229340 sum=00000000
potential match at 19209914606 i=229339 sum=00000000
potential match at 19209914606 i=229338 sum=00000000
...

...and so on for a few kilometers.

Remark about the "sum=00000000" IF the "sum=00000000" is related to some kind of checksum of the data that is being sync'ed (and this is a wild guess / pure speculation, I have absolutely no clue), then that might make sense as I usually create the initial raw image-files that are then used to install OSs for example as follows: dd if=/dev/zero of=my_image_file.img bs=1M count=10240 Therefore, all OS images that I'm synchronizing are likely to contain long sequences of bytes all set to 0 (if whatever is running in that OS in the meanwhile did not overwrite that with some other stuff).

Mixed facts/observations and thoughts

the infos posted above were generated by using rsync version 3.2.3-r4 running a "Gentoo Linux" distro on both sides (local server & remote server).
having the same problem since many years / many rsync versions (I would be speculating if I would state a specific version), so this problem was not caused by a recent change.
this problem of rsync hanging happens always at the same stage while synchronizing some specific files.
in the past I thought many times that I ended up fixing the problem by changing the rsync parameters, but in the end later I kept getting it again => it could therefore be that by continuously killing & restarting many times the program/sync, the whole thing keeps moving each time a little bit forward until it "overcomes" the problematic location (?). Just a subjective feeling.
I feel like that I have tried all rsync-parameters: with and without compression, with different compression and hashing algos, as well with "--protocol=29" and "--protocol=30", with and without "--block-size=" and "-max-alloc=" (tested all different max block sizes and allocations).
the big files that I (try to) backup from the remote server are not very "standard" because I create them with "cp -iv --reflink=always src_file src_file.reflink" to minimize the downtime of the VMs that are using those files (I shut down the VMs => I create those CoW/reflink files, I start the VMs and then later I take the backup of the reflink-files by using rsync), they're therefore CoW-files. BUT: I have later again the same problem when I synchronize the same files between my local server (at that stage the source files are absolutely normal, no fancy stuff) to my last backup stage, see the next point.
I have always then later the same problem(s) when synchronizing the same files between my local backup server (Gentoo distro) and my local 2nd-backup server (Linux Mint distro) which btw. uses an older version of rsync.
My local server is currently using as filesystem ZFS, the remote servers are/were using xfs/ext4/zfs - I believe that I had/have the same problem with all combinations (today's run had ZFS on the local server and XFS on the remote one). I strongly believe that the same used to happen as well when I was using in the local server ext4 (on top of an mdadm-RAID), but I'm not 100% sure.
I don't have problems related to disk space, on both the local and remote servers (TBs are available).
I do have "ClientAliveInterval" and "ClientAliveCountMax" set on the remote server, so this is not related to a connection loss (anyway as I stated above there is continuous CPU consumption).
the file size is probably not very important? For example I used to have a 400GiB file and while sync'ing that one it started hanging only at ~80-90%. Similar thing when doing the same with another 100GiB file, it always went through without problems.

Would it please be possible to look at this? I have exhausted all options related to rsync-parameters that could potentially offer a workaround for this problem, and doing a full download of XX GBs each time I take a backup makes me feel sad, hehe (but it's true!).

Does anybody else feel like having the same problem? (if yes then pls. post here, but only after being 99% sure that what I wrote matches your own problem, thank you :P )

Cheers, from Switzerland! :)

heyciao commented 3 years ago

The issue happened again today while synchronizing my 1st-level backup with my 2nd-level backup (expected, as mentioned above) => I took a backup of the affected files (from both the source and the destination servers, a copy of the file which was being synchronized) before that it was modified by the running rsync => I think that I can now reproduce the problem whenever I want, yuhu.

I initially tried to focus only on "the problem zone" but unluckily that didn't work - meaning that I originally saw that the rsync was hanging at ~35% therefore I initially extracted from the source&destination file(s) only their range of their bytes between 33%-37%, but then when running rsync directly against those small files it ran fine. I then did the same for the range 27%-37% => no problem. I then did the same for the range 10%-40% => kind of smooth, but I think that it stuttered (was hanging for a few seconds) somewhere in the middle. I then did the same for the range 0%-40% => this one is currently hanging (at 89% of the total file size) since ~50 minutes ("hanging" meaning "extremely slow" - the console output does update from time to time but the stated throughput is ~12kB). This would mean that the data that is present in the file before the area where the process starts to become slow does matter?

I can provide both the source & destination files to be used to reproduce the problem, but they're quite big (uncompressed size 20GiB each) and they do contain private informations (ssh-key & DB-passwords => I would have to change them), so I cannot just upload them here.

I can make them available to be downloaded from my private Nextcloud-instance, but would you actually be willing to look at this? I believe that fixing this would be good for a lot of people (many might be affected, maybe only in tiny parts of their files) but might think by mistake that their slow sync speed is caused by disk/network/cpu). If yes then can I somehow send you the link (to my files) privately? (maybe it's obvious how to send private messages on github but I never had to do it so far and so far I don't know how)

Cheers, Stefano

dr-who commented 3 years ago

See the other issues about multiple v options. Do you have the same problems when you only use a single "v" for verbosity? We have problems when we use three v. It seems defined for one or two v only.

heyciao commented 3 years ago

Yes - I anyway always use only one "v" option and just now I reproduced the problem by using no "v" option at all. The command that I use now to reproduce the problem is:

rsync \
--inplace --partial \
--delete \
-av \
--numeric-ids \
--progress \
-h \
<source_path> \
<server_name>:<target_path>

If I remember correctly the sync does work fine if I do a local sync without going through the network (meaning without ":").

WayneD commented 3 years ago

Rsync is doing checksum searching, which can be slow on a large file. You can use --whole-file (-W) to turn off the rsync algorithm when transferring large files (the faster your network is, the more likely that whole-file is faster).

heyciao commented 3 years ago

(Hi Wayne - btw. this is my first post here so thanks a lot for your efforts maintaining & improving this great program!! :+1: )

Rsync is doing checksum searching, which can be slow on a large file.

Yep, after something like 50 hours of tests & investigations & scanning forums & desperation I ended up understanding that hehe; but in my case at specific positions of some specific files it becomes ridiculously slow (something like ~1KiB/s, for hours) :P

You can use --whole-file...

So far my workaround was to first manually delete the impacted destination files therefore then triggering a full transfer, and I am planning to basically automate that deletion by using the option --delete-before (but so far I did not test it). My understanding is that using --whole file would have the same result (to just transfer to the destination all the contents of the source files), or am I wrong?

About using workarounds: In general transferring XXX GBs of data is for me not a big deal because I'm very lucky to have an unlimited FTTH connection with symmetric 1Gbps (but with rsync latency is involved so realistically in my case that goes down to ~30MiB/s), but other people are for sure more limited... .

Additionally, the behaviour/problem is quite hard to understand/identify, so the people that are affected might lose their minds while trying to debug their network/OS/disks/whatever thinking that that's the problem instead of rsync itself (if it's really rsync's problem; I'm ~almost sure, but I admit that I cannot exclude that I might still be overlooking something). Over a long term this problem might backfire by spreading as a trend over the whole Internet statements like "rsync is terrible/slow/doesn't work/hangs/etc..." (not completely wrong but as well not completely right as in the cases where users don't strictly monitor their disks/network/etc the issue might be related to their disks/network/etc).

Finally, with the introduction of the new hashing & compression algorithms (xxHash & zstd) rsync'ing big files over the network (with checksum searching) can be eeextremely fast (max achieved so far was ~500MiB/s) => for me it would be a pity if that would be useless because of issues related to the hashing search/delta transfer algo anyway hanging in certain situations :P

Next steps: I think that having a stable set of files that can be used to reproduce the problem would potentially be useful? Therefore, since yesterday I'm trying to anonymize a set of 2 files that have that problem, to then be able to (compress them efficiently to then) make them publicly available.

I therefore wrote a tiny Python script which reads the original files' contents and writes new ones by replacing single bytes with something else.

Yesterday I ran it by leaving all bytes that have a uint8-value of "0" untouched and replaced all other bytes with the uint8-value "1" => when I then ran the test by rsync'ing these anonymized files there was a slowdown (at the usual 98% completion) but nothing dramatic (sync rate dropped from ~160MiB/s to ~30MiB/s) => failure.
Today I'm still leaving all bytes that have a uint8-value of "0" untouched, but I'm replacing all other values by ~random values (random uint8-range between 97 and 122). My script is definitely not great performance-wise so it will need a few more hours to finish writing the new test files, then I'll run as usual a test against them.
If that still cannot reproduce the problem then I'll try to think about something else / some other variant.

Cheers

WayneD commented 2 years ago

Using --whole-file would work like your manually removing files you want to update as long as the mtime and/or size don't match. It avoids all the checksum record matching. The algorithm doesn't hang, it just gets really, really slow when the number of hashed records gets to be huge.

fbellet commented 2 years ago

I have the same problem rsyncing qcow2 images.

https://bugzilla.redhat.com/show_bug.cgi?id=2021853

The --whole-file option is not really interesting in my case, because my bandwidth is constrained, and the destination file is seeded with the previously rsynced qcow2 image from the same vm. So it would be great that rsync could benefit from this apriori knowledge, to only transfer the modified chunks in the disk image.

This is surprising, because for some qmu images, it works without noticeable slow-down:

receiving incremental file list
xxx1.qcow2
  9,205,055,488 100%   74.25MB/s    0:01:58 (xfr#1, to-chk=0/1)

sent 761,839 bytes  received 773,487,430 bytes  5,395,465.29 bytes/sec
total size is 9,205,055,488  speedup is 11.89
receiving incremental file list
xxx2.qcow2
 86,285,549,568 100%  110.90MB/s    0:12:21 (xfr#1, to-chk=0/1)

sent 5,846,304 bytes  received 2,030,685,184 bytes  2,091,968.66 bytes/sec
total size is 86,285,549,568  speedup is 42.37
receiving incremental file list
xxx3.qcow2
 20,404,830,208 100%   73.33MB/s    0:04:25 (xfr#1, to-chk=0/1)

sent 1,211,787 bytes  received 2,045,979,510 bytes  6,427,602.19 bytes/sec
total size is 20,404,830,208  speedup is 9.97
receiving incremental file list
xxx4.qcow2
1,075,334,021,120 100%  123.25MB/s    2:18:40 (xfr#1, to-chk=0/1)

sent 73,832,996 bytes  received 1,922,166,800 bytes  158,784.44 bytes/sec
total size is 1,075,334,021,120  speedup is 538.74
receiving incremental file list
xxx5.qcow2
 67,859,972,096  28%    0.58kB/s 83687:32:54

Is there a possibility to limit the number of records matching?

heyciao commented 2 years ago

I agree with what fbellet wrote. Small update about my attempts to anonymize one of my images to then make it available to reproduce the problem (sorry for the long period of inactivity - I was relocating and I'm finally almost done...):

leaving all bytes that have a value of "0" as "0" + modifying all bytes that have any another value to "1": only a small slowdown at the critical sync ranges (sync rate goes down from ~250MiB/s to ~15MiB/s) => failed.
same as above but by fully randomizing all non-"0"-bytes (randomized bytes in source & target not in sync): same result as above.

semi-randomizing groups of bytes (merging ranges of 10 bytes to a single target integer) using a fixed mapping table, source & target in sync.

If this doesn't work as well then I'll think again about something else... .

heyciao commented 2 years ago

Yuhu! It looks like that that last attempt managed to anonymize the files while keeping them able to replicate the issue :)))

Instructions: btw. to host all files you'll need ~52 GiB free space (2x ~6Gib compressed + 2x 20Gib uncompressed).

1) Go here (my own server)... https://share.segfault.digital/index.php/s/maK3dykYr6aDptw ...and download the 2 files that are available (total download size 10.7 GiB!).

The server should be able to transfer with at least ~25MiB/s. My server's upload/download rates are not limited but please still try not to download them multiple times :)

2) Create somewhere the directories to be used for the test sync, for example:

mkdir -p /tmp/test_slow_rsync/source
mkdir -p /tmp/test_slow_rsync/target

3) Assuming that the downloaded files are located in "~/Downloads", extract the files into the test directories:

gunzip -c  ~/Downloads/testfile-0_to_40pct-2.anon-source.gz > /tmp/test_slow_rsync/source/testfile-0_to_40pct-2.anon
gunzip -c  ~/Downloads/testfile-0_to_40pct-2.anon-target.gz > /tmp/test_slow_rsync/target/testfile-0_to_40pct-2.anon

Warning: each resulting file will be ~20GiB big.

4) Result:

myuser@MYPC /tmp/test_slow_rsync $ tree -as
.
├── [       4096]  source
│   └── [21474836480]  testfile-0_to_40pct-2.anon
└── [       4096]  target
    └── [21474836480]  testfile-0_to_40pct-2.anon

5) Finally, try to rsync them!

rsync -av --numeric-ids --inplace --partial --progress -h /tmp/test_slow_rsync/source/ MYPC:/tmp/test_slow_rsync/target/

When rsync reports 89% completion the sync rate should suddenly crash down to a few KB/s, and rsync should be using mostly only CPU (with a few KiB/s of HDD activity):

testfile-0_to_40pct-2.anon
         19.22G  89%   16.02kB/s   39:07:22

Concerning the target of the sync, it should work as well when using "localhost" instead of your hostname (which is what I'm referring to above with "MYPC"). But a target host is in any case needed to replicate the issue.

(replicated by using rsync version 3.2.3-r4 on Gentoo Linux on a laptop + workstation)

Btw. @fbellet maybe you could try to confirm that my 2 files can reproduce the issue as well on your system?

Cheers

fbellet commented 2 years ago

Forcing a protocol version downgrade with --remote-option=--protocol=29 seems to improve things a bit, by handling larger blocks as suggested. Block size is ~500kb with protocol 29, instead of 128kb with protocol 31 on a 200GB file to sync. However, even with these 4x bigger blocks, this is still very slow on huge files (4TB). Maybe the block size should not have a fixed max value, but instead the number of blocks per file should?

heyciao commented 2 years ago

Forcing a protocol version downgrade with --remote-option=--protocol=29 seems to improve things a bit, by handling larger blocks as suggested. Block size is ~500kb with protocol 29, instead of 128kb with protocol 31 on a 200GB file to sync. However, even with these 4x bigger blocks, this is still very slow on huge files (4TB). Maybe the block size should not have a fixed max value, but instead the number of blocks per file should?

You're right! When using --remote-option=--protocol=29 starting from 89% when using my test files it still syncs with ~14MB/s instead of the usual few KB/s. At least this ~workaround does not make it become stuck forever..., thx!

WayneD commented 2 years ago

In the future I will probably change the sender to segment really large files into chunks so the transfer will be as efficient as if the large file had been broken-up into several more-manageable files without having to spend I/O on the breakup and reconstruction. i.e., it won't be able to use a checksum match for a late chunk in the file for an early chunk in the file, but it also won't bog down the checksum matching algorithm. The nice thing about this is that this change should be totally contained to the sender logic, where the receiving side doesn't need to know that the sender is limiting its matching into sub-file buckets.

heyciao commented 2 years ago

Thanks a lot for thinking about this :) Hoping that I understood you correctly, personally, I would probably like users to be able to optionally set the value of that threshold (e.g. "max_contiguous_chunk/segment_to_sync_at_once"), instead of that being enforced only by the app's code.

Reasons:

the example that I posted here is not a worst-case-scenario (if I remember correctly I had similar problems in the past as well with a ~4GiB file). Other files might be even worse... . (btw. I do see often low network&disk usage during rsyncs (and ~low overall sync speed), nothing dramatic but now I wonder if the sync speed could be higher with such a parameter)
such a parameter would empower users to be able to deal directly with very special combinations of bytes/hashes that might cause the same problem within fewer bytes (than what was initially defined/hardcoded as critical size).
the resulting sync speed while dealing with the hashes involves not only the file size but as well the sender/receiving CPUs and/or RAM which can be fast/slow -> sync speed when using chunk/segment size Z might be acceptable when using CPU x but bad when using CPU y.

Summarized, the usual problem of setting a global hardcoded value which will definitely make somebody unhappy for some reason and others happy for some other reason, and/or a mix of the two depending on the types of files that they're synchronizing (respectively, impossible to reliably define a value which covers all usecases).

Btw., about terminology: maybe "chunk" is already being used by other parts/options of the program -> "segment" and "bucket" sound good, I'm spontaneously throwing into the mix as candidates as well "slice" and "portion". ("portion" is the one that sounds to me the most high-level, but maybe it sounds weird, not sure?)

jowagner commented 2 years ago

Disadvantage of the proposed solution is that the transfer can no longer detect blocks that have moved location substantially, e.g. for a VM disk image this can happen when running filesystem maintenance inside a VM such as btrfs balance or defrag.

heyciao commented 2 years ago

Wouldn't that anyway be a corner case? The last time I defragged a filesystem was probably 20 years ago, and that's probably anyway a big change on a FS-level (meaning that a high cost related to resynchronizing the data at a later stage would be justified). Just my personal opinion :)

jowagner commented 2 years ago

On some Linux distributions filesystem maintenance is automatic using https://github.com/kdave/btrfsmaintenance and active by default.

ch-f commented 1 year ago

Any updates/news/tests on this issue? I'm pretty interested in syncing big images too.

ch-f commented 1 year ago

While we wait for the resolution of rsync's large file handling issue, I've developed an intermediate solution called QuickChunk for my local LAN backup needs. QuickChunk is designed to efficiently handle the synchronization of large files.

You can find more about QuickChunk and its utility here: https://github.com/ch-f/QuickChunk

futureweb commented 1 year ago

Hey there,

we are currently implementing a backup system for our Virtual Machine PLOOP images using rsync. Unfortunately, we also seem to be encountering this issue with our larger images (1-5TB).

Initially, the transfer starts off smoothly and quickly, but as the transfer progresses, it slows down, and the CPU load on the source server rsync process gradually approaches 100%. Using latest rsync with all possible optimizations activated except for asm-roll which causes rsync to crash:

# ./rsync -V
rsync  version 3.2.7  protocol version 31
Copyright (C) 1996-2022 by Andrew Tridgell, Wayne Davison, and others.
Web site: https://rsync.samba.org/
Capabilities:
    64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints,
    socketpairs, symlinks, symtimes, hardlinks, hardlink-specials,
    hardlink-symlinks, IPv6, atimes, batchfiles, inplace, append, ACLs,
    xattrs, optional secluded-args, iconv, prealloc, stop-at, no crtimes
Optimizations:
    SIMD-roll, no asm-roll, openssl-crypto, asm-MD5
Checksum list:
    xxh128 xxh3 xxh64 (xxhash) md5 md4 sha1 none
Compress list:
    zstd lz4 zlibx zlib none
Daemon auth list:
    sha512 sha256 sha1 md5 md4

@WayneD - Any chance that the "chunking" will arrive sometime soon? 🤔

I guess/hope this would help big times! 😊

Thank you, bye from Austria Andreas

RsyncProject / rsync

rsync hangs with some specific files, using 100% of 1 CPU core but without disk nor network activity #217