Closed gdt closed 3 months ago
I am unable to reproduce this stack growth.
@olafhering could you please report the full version you were using, including the version of the compiler used. Are you using a different version on the remote end? Please also report any non-default preferences that might impact the syncing process (you can leave out prefs such as ignore, paths, etc.).
Is the remote (in this case, the receiving) end significantly slower than the sending side, or is there high latency in the network?
If you can reproduce the crash with the GUI, could you try and see if the stack starts growing with the text UI.
If you can reproduce the crash then some things to try:
Both unison binaries are identical. They are built with dune, using OCaml 4.14.1. Both systems are in the same LAN, connected via 1G.
While looking at the .prf
file, I realized there are ignore = Path {a/large-dir}
lines. I forgot about them while renaming a
to b
. As a result the new initial sync processed b/large-dir
. The only "relevant/non-standard" entry in .prf
seems to be times=true
.
I will see if I can reproduce it again with this new finding.
The ignore
preferences should not play any role here (neither should times
).
I guess you already answered it, but just in case: have you used any preference like maxthreads
or stream
?
If you manage to reliably reproduce the issue, could you then try with version 2.53.0 and see if it works there?
I've done some more debugging and can see the same code being executed, yet the stack is not growing (and I'm testing with far larger directory trees). I see this as unlikely but it could be that your binary is not properly optimized. If you manage to reproduce the issue, could you try with a binary that is built in a different environment (for example, download from GH)? More precisely, you should try with a binary built without opam
or with a completely clean opam
env. I know this is far-fetched but it could be that you have some compiler settings that produce a non-optimized or differently optimized binary. (Even if that turns out to be the case, it could still be a bug to fix then.)
The binaries come from https://build.opensuse.org/package/show/Archiving:unison/unison
and will always be a clean build using https://build.opensuse.org/project/show/Archiving:unison:buildrequires
.
It will take some time until I find the time to reproduce it.
I can easily reproduce with your build and also the 2.53.3 build from GH. I was myself testing with the latest code (unreleased, but available at GH) and this worked even with a tiny stack. It is thus likely that the next release fixes this issue but it would be great if you could actually test the latest code already now.
I'm going to close this as believe fixed in the current code base. If you can repro with 2.53.4 (RC2 today, release tomorrow), please post and I'm happy to reopen.
I was told on unison-users to report here a similar case:
Version 2.53.4 (ocaml 5.1.0) OS Arch Linux
I got reproducible crashes syncing my laptop against my desktop, both running linux. The crash happened while the sync was going on, no user interaction was necessary. I could repeat it three times, all three times the backtrace in gdb showed caml_try_realloc_stack
as the top entry.
I can also confirm that using the text ui unison worked without any problem.
I attach three different traces I got from the three crashes.
We now believe @norbusan 's comment is the same thing as #1006 . I am therefore closing #900. @norbusan : please follow up in #1006 if useful/appropriate.
I have prepared a tentative fix which I believe might be common to both this ticket and #1006, even though archlinux seems to be hit harder and might still be a different issue.
@olafhering if you are interested in testing then see details at https://github.com/bcpierce00/unison/issues/1006#issuecomment-2028792202
As reported by @olafhering on unison-users@ after discussion, propagation seems to use excessive stack depth. The repro recipe is roughly:
Looking at the stack trace below, it smells like there is a recursive function which does not do tail recursion when it should, but that is speculation. The bug is that while 350K files might reasonably take 350K * A bytes of RAM for some A, it should not result in stack usage proportional to the number of files.