bcpierce00 / unison

Unison file synchronizer
GNU General Public License v3.0
3.86k stars 224 forks source link

propagation (of new files in a directory?) uses 6 stack frames per file, rather than iterating (NOT ABOUT ocaml 5.x!!) #990

Closed gdt closed 3 months ago

gdt commented 5 months ago

As reported by @olafhering on unison-users@ after discussion, propagation seems to use excessive stack depth. The repro recipe is roughly:

Looking at the stack trace below, it smells like there is a recursive function which does not do tail recursion when it should, but that is speculation. The bug is that while 350K files might reasonably take 350K * A bytes of RAM for some A, it should not result in stack usage proportional to the number of files.

#0  caml_hash (count=<optimized out>, limit=<optimized out>, seed=<optimized out>, obj=15) at hash.c:196
#1  0x000055bdf7673810 in camlStdlib__Hashtbl__key_index_1273 () at hashtbl.ml:504
#2  0x000055bdf7673ac2 in camlStdlib__Hashtbl__find_1304 () at hashtbl.ml:538
#3  0x000055bdf75efcdc in camlAbort__errorCountCell_421 () at src/abort.ml:40
#4  0x000055bdf75eff98 in camlAbort__check_547 () at src/abort.ml:46
#5  0x000055bdf760aed2 in camlFiles__fun_4454 () at src/files.ml:665
#6  0x000055bdf7627853 in camlLwt_util__fun_607 () at src/lwt/lwt_util.ml:77
#7  0x000055bdf76262e3 in camlLwt__apply_356 () at src/lwt/lwt.ml:61
#8  0x000055bdf76277a2 in camlLwt_util__run_in_region_1_449 () at src/lwt/lwt.ml:95
#9  0x000055bdf76262e3 in camlLwt__apply_356 () at src/lwt/lwt.ml:61
#10 0x000055bdf76265fc in camlLwt__fun_547 () at src/lwt/lwt.ml:80
#11 0x000055bdf763eda0 in camlStdlib__List__iter_507 () at list.ml:110
#12 0x000055bdf76260ba in camlLwt__restart_281 () at src/lwt/lwt.ml:31
#13 0x000055bdf76276c7 in camlLwt_util__leave_region_443 () at src/lwt/lwt.ml:69
#14 0x000055bdf7627898 in camlLwt_util__fun_622 () at src/lwt/lwt_util.ml:78
...
#285323 0x000055bdf763eda0 in camlStdlib__List__iter_507 () at list.ml:110
#285324 0x000055bdf76260ba in camlLwt__restart_281 () at src/lwt/lwt.ml:31
#285325 0x000055bdf76276c7 in camlLwt_util__leave_region_443 () at src/lwt/lwt.ml:69
#285326 0x000055bdf7627898 in camlLwt_util__fun_622 () at src/lwt/lwt_util.ml:78
#285327 0x000055bdf76262e3 in camlLwt__apply_356 () at src/lwt/lwt.ml:61
#285328 0x000055bdf76265fc in camlLwt__fun_547 () at src/lwt/lwt.ml:80
#285329 0x000055bdf763eda0 in camlStdlib__List__iter_507 () at list.ml:110
#285330 0x000055bdf76260ba in camlLwt__restart_281 () at src/lwt/lwt.ml:31
#285331 0x000055bdf76276c7 in camlLwt_util__leave_region_443 () at src/lwt/lwt.ml:69
#285332 0x000055bdf7627898 in camlLwt_util__fun_622 () at src/lwt/lwt_util.ml:78
#285333 0x000055bdf76262e3 in camlLwt__apply_356 () at src/lwt/lwt.ml:61
#285334 0x000055bdf76265fc in camlLwt__fun_547 () at src/lwt/lwt.ml:80
#285335 0x000055bdf763eda0 in camlStdlib__List__iter_507 () at list.ml:110
#285336 0x000055bdf76260ba in camlLwt__restart_281 () at src/lwt/lwt.ml:31
#285337 0x000055bdf76276c7 in camlLwt_util__leave_region_443 () at src/lwt/lwt.ml:69
#285338 0x000055bdf7627898 in camlLwt_util__fun_622 () at src/lwt/lwt_util.ml:78
#285339 0x000055bdf76262e3 in camlLwt__apply_356 () at src/lwt/lwt.ml:61
#285340 0x000055bdf76265fc in camlLwt__fun_547 () at src/lwt/lwt.ml:80
#285341 0x000055bdf763eda0 in camlStdlib__List__iter_507 () at list.ml:110
#285342 0x000055bdf76260ba in camlLwt__restart_281 () at src/lwt/lwt.ml:31
#285343 0x000055bdf76276c7 in camlLwt_util__leave_region_443 () at src/lwt/lwt.ml:69
#285344 0x000055bdf76277e8 in camlLwt_util__fun_616 () at src/lwt/lwt_util.ml:77
#285345 0x000055bdf76262e3 in camlLwt__apply_356 () at src/lwt/lwt.ml:61
#285346 0x000055bdf76265fc in camlLwt__fun_547 () at src/lwt/lwt.ml:80
#285347 0x000055bdf763eda0 in camlStdlib__List__iter_507 () at list.ml:110
#285348 0x000055bdf76260ba in camlLwt__restart_281 () at src/lwt/lwt.ml:31
#285349 0x000055bdf7627da3 in camlLwt_unix_impl__restart_threads_621 () at src/lwt/lwt.ml:69
#285350 0x000055bdf762838c in camlLwt_unix_impl__run_715 () at src/lwt/generic/lwt_unix_impl.ml:125
#285351 0x000055bdf76140ae in camlUicommon__transportItems_1958 () at src/uicommon.ml:736
#285352 0x000055bdf74e2707 in camlDune__exe__Uigtk3__synchronize_8710 () at src/uigtk3.ml:3736
#285353 0x000055bdf74b4482 in camlDune__exe__Uigtk3__getLock_945 () at src/uigtk3.ml:138
#285354 0x000055bdf74f53d1 in camlGtkSignal__safe_call_inner_801 () at src/gtkSignal.ml:61
tleedjarv commented 5 months ago

I am unable to reproduce this stack growth.

@olafhering could you please report the full version you were using, including the version of the compiler used. Are you using a different version on the remote end? Please also report any non-default preferences that might impact the syncing process (you can leave out prefs such as ignore, paths, etc.).

Is the remote (in this case, the receiving) end significantly slower than the sending side, or is there high latency in the network?

If you can reproduce the crash with the GUI, could you try and see if the stack starts growing with the text UI.

If you can reproduce the crash then some things to try:

olafhering commented 5 months ago

Both unison binaries are identical. They are built with dune, using OCaml 4.14.1. Both systems are in the same LAN, connected via 1G.

While looking at the .prf file, I realized there are ignore = Path {a/large-dir} lines. I forgot about them while renaming a to b. As a result the new initial sync processed b/large-dir. The only "relevant/non-standard" entry in .prf seems to be times=true.

I will see if I can reproduce it again with this new finding.

tleedjarv commented 5 months ago

The ignore preferences should not play any role here (neither should times).

I guess you already answered it, but just in case: have you used any preference like maxthreads or stream?

If you manage to reliably reproduce the issue, could you then try with version 2.53.0 and see if it works there?

tleedjarv commented 5 months ago

I've done some more debugging and can see the same code being executed, yet the stack is not growing (and I'm testing with far larger directory trees). I see this as unlikely but it could be that your binary is not properly optimized. If you manage to reproduce the issue, could you try with a binary that is built in a different environment (for example, download from GH)? More precisely, you should try with a binary built without opam or with a completely clean opam env. I know this is far-fetched but it could be that you have some compiler settings that produce a non-optimized or differently optimized binary. (Even if that turns out to be the case, it could still be a bug to fix then.)

olafhering commented 5 months ago

The binaries come from https://build.opensuse.org/package/show/Archiving:unison/unison and will always be a clean build using https://build.opensuse.org/project/show/Archiving:unison:buildrequires.

It will take some time until I find the time to reproduce it.

tleedjarv commented 5 months ago

I can easily reproduce with your build and also the 2.53.3 build from GH. I was myself testing with the latest code (unreleased, but available at GH) and this worked even with a tiny stack. It is thus likely that the next release fixes this issue but it would be great if you could actually test the latest code already now.

gdt commented 4 months ago

I'm going to close this as believe fixed in the current code base. If you can repro with 2.53.4 (RC2 today, release tomorrow), please post and I'm happy to reopen.

norbusan commented 4 months ago

I was told on unison-users to report here a similar case:

Version 2.53.4 (ocaml 5.1.0) OS Arch Linux

I got reproducible crashes syncing my laptop against my desktop, both running linux. The crash happened while the sync was going on, no user interaction was necessary. I could repeat it three times, all three times the backtrace in gdb showed caml_try_realloc_stack as the top entry.

I can also confirm that using the text ui unison worked without any problem.

I attach three different traces I got from the three crashes.

gdb-trace1.log gdb-trace2.log gdb-trace3.log

gdt commented 3 months ago

We now believe @norbusan 's comment is the same thing as #1006 . I am therefore closing #900. @norbusan : please follow up in #1006 if useful/appropriate.

tleedjarv commented 3 months ago

I have prepared a tentative fix which I believe might be common to both this ticket and #1006, even though archlinux seems to be hit harder and might still be a different issue.

@olafhering if you are interested in testing then see details at https://github.com/bcpierce00/unison/issues/1006#issuecomment-2028792202