Closed raphi011 closed 4 years ago
Thank you for reporting. Can you attach a trace, maybe run valgrind kdb ls /
? Some information on how to reproduce it would also be helpful.
==110109== Memcheck, a memory error detector
==110109== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==110109== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==110109== Command: kdb ls /
==110109==
==110109== Invalid read of size 8
==110109== at 0x52778E8: updatePointers (mmapstorage.c:948)
==110109== by 0x52778E8: libelektra_mmapstorage_LTX_elektraPluginget (mmapstorage.c:1148)
==110109== by 0x5270711: elektraCacheGet (cache.c:321)
==110109== by 0x48E6849: elektraCacheLoad (kdb.c:825)
==110109== by 0x48E6849: kdbGet (kdb.c:1038)
==110109== by 0x174EA0: get (kdb.hpp:185)
==110109== by 0x174EA0: get (kdb.hpp:168)
==110109== by 0x174EA0: Cmdline::Cmdline(int, char**, Command*) (cmdline.cpp:310)
==110109== by 0x1696F6: main (main.cpp:170)
==110109== Address 0x5285d10 is not stack'd, malloc'd or (recently) free'd
==110109==
Sorry, I crashed by the signal SIGSEGV
This should not have happened!
Please report the issue at https://issues.libelektra.org/
==110109==
==110109== Process terminating with default action of signal 6 (SIGABRT): dumping core
==110109== at 0x4CDEF25: raise (in /usr/lib/libc-2.30.so)
==110109== by 0x4CC8896: abort (in /usr/lib/libc-2.30.so)
==110109== by 0x1DF04B: catchSignal(int) (main.cpp:110)
==110109== by 0x4CDEFAF: ??? (in /usr/lib/libc-2.30.so)
==110109== by 0x52778E7: updatePointers (mmapstorage.c:946)
==110109== by 0x52778E7: libelektra_mmapstorage_LTX_elektraPluginget (mmapstorage.c:1148)
==110109== by 0x5270711: elektraCacheGet (cache.c:321)
==110109== by 0x48E6849: elektraCacheLoad (kdb.c:825)
==110109== by 0x48E6849: kdbGet (kdb.c:1038)
==110109== by 0x174EA0: get (kdb.hpp:185)
==110109== by 0x174EA0: get (kdb.hpp:168)
==110109== by 0x174EA0: Cmdline::Cmdline(int, char**, Command*) (cmdline.cpp:310)
==110109== by 0x1696F6: main (main.cpp:170)
==110109==
==110109== HEAP SUMMARY:
==110109== in use at exit: 173,723 bytes in 1,005 blocks
==110109== total heap usage: 2,915 allocs, 1,910 frees, 437,331 bytes allocated
==110109==
==110109== LEAK SUMMARY:
==110109== definitely lost: 0 bytes in 0 blocks
==110109== indirectly lost: 0 bytes in 0 blocks
==110109== possibly lost: 0 bytes in 0 blocks
==110109== still reachable: 173,723 bytes in 1,005 blocks
==110109== suppressed: 0 bytes in 0 blocks
==110109== Rerun with --leak-check=full to see details of leaked memory
==110109==
==110109== For lists of detected and suppressed errors, rerun with: -s
==110109== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
fish: 'valgrind kdb ls /' terminated by signal SIGABRT (Abort)
Seems to be an issue with mmap
. After running rm -rf ~/.cache/elektra
it works again
Yes, the crash definitely happened because of cache
/ mmap
. Can you explain what you did to trigger this?
I ran my elektrad benchmarks, after clearing the cache and re-running it didn't break again. If this occurs again I will mention it in this issue.
@mpranj a big thanks that you take care of this issue.
Why are there casts to char*
in the segfaulting code?
Why are there casts to char* in the segfaulting code?
Pretty sure the cast is not the problem. It's used for byte-wise pointer arithmetic. Before storage we subtract the old base address, making all pointers relative to the base address of the mmap region. After mmaping again, we update the pointers to the new base mmap address by adding the new address to all pointers.
Ok. Do you have an idea how this problem was caused?
Do you have an idea how this problem was caused?
Sorry but currently no. This code is basically executed on each kdbGet now, so it's not like a seldom taken code path.
Corrupting the cache file in exactly the right spot might cause something like that (like a wrong ks->size or similar).
Are the operations on the cache files atomic? Did you test aborting the process while writing the cache?
There are no testcases for such a scenario, but the cache file is written to a temp file and only rename()
d on success.
EDIT: Additionally, the magic footer of the mmap file woud prevent us from reading an unfinished file. The footer should be the last write. I could enforce it via an additional msync()
, but it might slow things down.
Additionally, the magic footer of the mmap file woud prevent us from reading an unfinished file.
So most probably it was something else.
I could enforce it via an additional msync()
The sync/msync would be the correct thing to do (also on the folder where you rename, see sync plugin).
but it might slow things down.
As it only impacts writing, you do not need to worry.
Just to clarify: msync
is done already. This would just mean another msync
to enforce order of the writes. Let's do it and see what it does in the benchmarks.
Still: It's unclear that this is the reason for the segfault here.
@raphi011 you do not have the cache file anymore to reproduce this issue? Did you do some upgrade of Elektra or similar?
@raphi011 you do not have the cache file anymore to reproduce this issue? Did you do some upgrade of Elektra or similar?
Unfortunately not ..
The commits referencing this issue are a shot in the dark. The assumption was that the OS reordered some writes from mmap and mmapstorage
did not correctly detect incompletely written data, which was possible before those commits.
Even having the cache files would probably not help as there is no way to know whether some data was not completely written.
What is the actual lifetime of the keys in mmap storage? While benchmarking the memory of my server is growing and growing even though I'm ksClear
ing all keysets after it's usage - until it get's killed by the OS because it uses too much memory.
When is memory that is allocated with mmap freed?
When is memory that is allocated with mmap freed?
The answer is never. This was a design decision discussed with @markus2330. The memory pages are shared and the OS implements COW via mmap. This means depending on the usage it is possible to have a KeySet open which is larger than the available main memory.
The alternative back then was to implement reference counting to unmap the mmaped files which needed some tricky implementation in the core. (Think of another KeySet referencing the keys from a mmaped KeySet, it is tricky to know when it can be really unmapped and basically out of scope for my thesis)
Maybe this is the reason why you had incomplete mmap cache files. 🤔
@markus2330 this is probably not the right place to discuss this .. but how is this supposed to work with elektrad that has long lived handles? After a session expires a new handle + keyset is created and the old one is closed - but the memory is not freed which results in the OS killing the process after a while (which happens pretty fast in the benchmarks).
This probably means I have to disable mmap somehow for elektrad .. is this possible?
The OS can free up memory from mmap that has not been changed, so we "only" leak the pages that are changed COW. But you're right, we might need to re-evaluate this decision.
This probably means I have to disable mmap somehow for elektrad .. is this possible?
Easy: kdb cache disable
. The cache is stored on a per-system-user basis, so this needs to be also done on a per-user basis.
If you need to completely disable it then compile without the cache
plugin
(e.g. -DPLUGINS="ALL;-cache"
).
The alternative back then was to implement reference counting to unmap the mmaped files which needed some tricky implementation in the core. (Think of another KeySet referencing the keys from a mmaped KeySet, it is tricky to know when it can be really unmapped and basically out of scope for my thesis)
@markus2330 One problem was/is that unmapping would have to be implemented in the core, and mmap()
is not part of the C standards, but let's assume this is no problem. Then unmapping can be done somewhat easily in the core if ksAppend()
really copies the keys and does not simply reference them. Referencing mmaped Keys was afair the biggest problem with this unmapping approach. I'm not sure if we can/may change the semantics of ksAppend()
though.
I have disabled the cache plugin completely - as described above via -DPLUGINS="ALL;-cache"
and somehow memory is STILL not freed after calling ksDel
If you compiled Elektra cleanly with the cmake option -DPLUGINS="ALL;-cache"
(verify by cmake output "Exclude Plugin cache because explicitly excluded"), then I see no way how mmap could affect your results, except if you explicitly enabled mmapstorage as default storage (again, can be excluded by -DPLUGINS="ALL;-cache;-mmapstorage"
).
If you're sure we're leaking somewhere, then you can always run valgrind --leak-check=full /your/kdb/tests
and see which source file has leaked memory (make sure you have debug symbols, which is default anyway: CMAKE_BUILD_TYPE: RelWithDebInfo).
@mpranj wrote:
and basically out of scope for my thesis
I fully agree. For now we only find ways of how to disable the cache when it does not work. I created #3155 and #3156
@raphi011 wrote:
this is probably not the right place to discuss this
The issue tracker is the right place. But please always make sure to either rename the title (if the previous discussion is finished) or open new issues.
which happens pretty fast in the benchmarks
Can you tell us about how many KDB handles (sessions) we are speaking here? Maybe this is only a synthetic (benchmark) problem anyway. In real-world scenarios not more than a few handles are expected to be alive at the same time. (elektrad is there to configure a single system.)
even though I'm ksClearing all keysets after it's usage I have disabled the cache plugin completely - as described above via -DPLUGINS="ALL;-cache" and somehow memory is STILL not freed after calling ksDel
ksClear is internal and afaik there are no tests if it is safe to reuse the keyset afterwards. So maybe you already lost pointers to allocated memory within ksClear. If this is the case, please open a new issue. (Lets keep this issue to discuss the problems with mmap and too many handles.)
I will close this issue since a fix has been added. If this happens again please feel free to reopen it.
While running my benchmarks I somehow broke elektra.
When I run a kdb command like
kdb ls /
I get the following message (also when calling the elektra library directly)
I've already run
kdb reset
andkdb reset-elektra
and re-compiled and uninstalled + installed kdb. Any ideas how I could fix / debug this?