MesserLab / SLiM

SLiM is a genetically explicit forward simulation software package for population genetics and evolutionary biology. It is highly flexible, with a built-in scripting language, and has a cross-platform graphical modeling environment called SLiMgui.
https://messerlab.org/slim/
GNU General Public License v3.0
160 stars 30 forks source link

"buffer overflow detected" when trying to install SLiM on Linux #437

Closed alfroids closed 5 months ago

alfroids commented 5 months ago

I've been having trouble when trying to install SLiM from the AUR using paru in Arch. The package seems to be downloaded and built correctly, but when installation tests start (specifically -testSLiM), I get a *** buffer overflow detected *** and the installation is aborted. The installation log ends like this:

[100%] Built target SLiMgui
==> Starting check()...
Running 'slim -testEidos'...

SUCCESS count: 6953
Running 'slim -testSLiM'...
*** buffer overflow detected ***: terminated
/home/alfroids/.cache/paru/clone/slim-simulator/PKGBUILD: line 42:  8003 Aborted                 (core dumped) ./slim -testSLiM
SLiM tests failed
==> ERROR: A failure occurred in check().
    Aborting...
error: failed to build 'slim-simulator-4.2.r0.g6d00ad37-1':
error: packages failed to build: slim-simulator-4.2.r0.g6d00ad37-1
bhaller commented 5 months ago

Hi @alfroids. I'm going to ping @grahamgower since he maintains the Arch installer, and is an Arch user himself. I don't know what might be going on here. The text "buffer overflow detected" does not occur in the SLiM code base, I believe, so I don't think this is a SLiM error. So I'm not sure whose error it is! Somebody's buffer is overflowing and they're complaining about it, but I'm not sure whether it has anything to do with SLiM or not.

A couple of questions. (1) Have you tried doing this multiple times, and it always fails the same way each time? (2) Does the failure of the installer test mean that SLiM gets removed, or does it remain installed on your machine? If it remains installed on your machine, can you run slim -testSLiM yourself, and does it succeed? (3) Can you look at /home/alfroids/.cache/paru/clone/slim-simulator/PKGBUILD and see what is at line 42?

It would be nice to track this down, but I guess nobody else on Arch (including @grahamgower himself) has hit this error, so it might be something specific to your machine. Anything unusual about it? Is it a real Arch box, or is it a virtual machine running under something-or-other? Does it have a full, standard Arch install? What version, etc.?

alfroids commented 5 months ago

(1) Yes, I tried multiple times today and yesterday and always got the same result. (2) The files and scripts are still there in the same paru/clone/slim-simulator directory and I can run them directly. SLiMgui runs and the default recipe works fine, but when I run ./slim -testSLiM I get the same message as before, just shorter. (3) Line 42 is the start of the check() function definition:

42 check() {
43         cd build
44         echo "Running 'slim -testEidos'..."
45         ./slim -testEidos || (echo "Eidos tests failed"; exit 1)
46         echo "Running 'slim -testSLiM'..."
47         ./slim -testSLiM || (echo "SLiM tests failed"; exit 1)
48 }

It's a fresh Arch install using the archinstall script directly on my laptop (not a VM). This is my first time using Arch and I am beginning to set things up, so it's possible I missed something basic that's resulting on this overflow.

bhaller commented 5 months ago

OK. Good to know it happens at the command line too; that eliminates some possible causes. I really wonder who is printing out that "buffer overflow detected" message. That is what it says when run at the command line, too? Super weird. If SLiM were raising an error itself, I think you'd see that error message. Whose buffer is overflowing? Hmm, googling that phrase brings up this:

https://stackoverflow.com/questions/70570379/whats-the-source-of-this-enigmatic-buffer-overflow-detected-terminated

Same error message; but nothing in that discussion has anything whatsoever to do with SLiM, as far as I can see. I also found this:

https://bugzilla.redhat.com/show_bug.cgi?id=2165653#c3

Again, no idea whether it's related to this problem or not, but it looks like there was a bug in Fedora recently that caused this error, and it has been found and fixed. Could this be biting you, conceivably?

Hmm, you said it did a core dump when it crashed. I wonder if we could get a backtrace from the core dump? I have no idea how to do that on Linux. Maybe Google can help, or maybe @grahamgower can help with that. If we can get a good backtrace for the point at which this occurs, it should shed a great deal of light on the problem. I think that's the next step.

alfroids commented 5 months ago

Hi @bhaller. I just tried building SLiM from source expecting I would get the same overflow problem, but weirdly I didn't. I was able to build SLiM and SLiMgui and run both tests successfully. I tried looking into this "buffer overflow detected" message, but didn't really understand what causes it and how to avoid or fix it. I don't know what might be happening there as it seems to be a problem specifically between my OS and the AUR installation.

bhaller commented 5 months ago

Yes, I suspected that building from sources might work. So it's some kind of side effect of having built SLiM through the Arch installer. @grahamgower is there anything you want to pursue here, or shall we just shrug and move on? I can't imagine how we'd figure out what the problem is here. It's a bit alarming, but what can one do?

bhaller commented 5 months ago

(It'd still be great to get a backtrace from the crash log, if anybody can figure out how to do that...)

grahamgower commented 5 months ago

This appears to be a bug in SLiM that is exposed when building with -D_FORTIFY_SOURCE=3 and recent compilers (gcc>=12, not sure which clang) and glibc (>=2.34), which enables some aggressive memory checks. It seems that Arch have recently changed their defaults so that packages are built with -D_FORTIFY_SOURCE=3 in CFLAGS and CXXFLAGS (configurable in /etc/makepkg.conf). When updating Arch, config files like makepkg.conf are not overridden, but the new config files are placed in a *.pacnew file instead - so folks with fresh installs are more likely to run into this than me, as I don't usually notice the *.pacnew files for months (years?).

So, yes, I can reproduce this. Here's the backtrace.

#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6,
    no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007ffff7aab393 in __pthread_kill_internal (signo=6, threadid=<optimized out>)
    at pthread_kill.c:78
#2  0x00007ffff7a5a6c8 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007ffff7a424b8 in __GI_abort () at abort.c:79
#4  0x00007ffff7a43395 in __libc_message_impl (
    fmt=fmt@entry=0x7ffff7bbb161 "*** %s ***: terminated\n")
    at ../sysdeps/posix/libc_fatal.c:132
#5  0x00007ffff7b3273b in __GI___fortify_fail (
    msg=msg@entry=0x7ffff7bbb148 "buffer overflow detected") at fortify_fail.c:24
#6  0x00007ffff7b320e6 in __GI___chk_fail () at chk_fail.c:28
#7  0x00007ffff7b33945 in ___snprintf_chk (s=s@entry=0x7fffffffcb02 "",
    maxlen=maxlen@entry=65, flag=flag@entry=2, slen=slen@entry=63,
    format=format@entry=0x555555abf7a6 "%02x") at snprintf_chk.c:29
#8  0x00005555557fa9f7 in snprintf (__fmt=<optimized out>, __n=<optimized out>,
    __s=<optimized out>, __s=<optimized out>, __n=<optimized out>,
    __fmt=<optimized out>) at /usr/include/bits/stdio2.h:54
#9  Eidos_hash_to_string (
    hash=0x7fffffffcf70 "g\220]\023j8\315\001`\v\340\031\006\210%4$y\376\nb-\002\356k}ǖ\303(dbh\364\001VUU", string=0x7fffffffcb02 "")
    at /usr/src/debug/slim-simulator/SLiM/eidos/eidos_globals.cpp:3803
#10 Species::WriteProvenanceTable (this=<optimized out>, p_tables=<optimized out>,
    p_use_newlines=<optimized out>, p_include_model=<optimized out>)
    at /usr/src/debug/slim-simulator/SLiM/core/species.cpp:5761
#11 0x0000555555803243 in Species::WriteTreeSequence (this=0x555556023ef0,
    p_recording_tree_path=..., p_binary=false, p_simplify=<optimized out>,
    p_include_model=true, p_metadata_dict=0x0)
    at /usr/src/debug/slim-simulator/SLiM/core/species.cpp:6321
#12 0x000055555582adf3 in Species::ExecuteMethod_treeSeqOutput (this=0x555556023ef0,
    p_method_id=<optimized out>, p_arguments=..., p_interpreter=...)
    at /usr/src/debug/slim-simulator/SLiM/core/species_eidos.cpp:3303
#13 0x000055555582699e in Species::ExecuteInstanceMethod (this=<optimized out>,
    p_method_id=<optimized out>, p_arguments=..., p_interpreter=...)
    at /usr/src/debug/slim-simulator/SLiM/core/species_eidos.cpp:1707
#14 0x00005555559ef3f8 in EidosValue_Object::ExecuteMethodCall (this=0x5555565d6180,
    p_method_id=327, p_method_signature=0x555555d16070,
    p_arguments=std::vector of length 5, capacity 8 = {...}, p_interpreter=...)
    at /usr/src/debug/slim-simulator/SLiM/eidos/eidos_value.cpp:2370
#15 0x0000555555927ea2 in EidosInterpreter::Evaluate_Call (this=0x7fffffffdca0,
    p_node=0x7ffff787cd10)
    at /usr/src/debug/slim-simulator/SLiM/eidos/eidos_interpreter.cpp:1627
#16 0x000055555591ad2b in EidosInterpreter::FastEvaluateNode (
    p_node=<optimized out>, this=<optimized out>, this=<optimized out>, p_node=<optimized out>) at /usr/src/debug/slim-simulator/SLiM/eidos/eidos_interpreter.h:318
#17 EidosInterpreter::Evaluate_CompoundStatement (this=0x7fffffffdca0, p_node=<optimized out>) at /usr/src/debug/slim-simulator/SLiM/eidos/eidos_interpreter.cpp:901
#18 0x000055555591f5b5 in EidosInterpreter::FastEvaluateNode (p_node=<optimized out>, this=<optimized out>, this=<optimized out>, p_node=<optimized out>)
    at /usr/src/debug/slim-simulator/SLiM/eidos/eidos_interpreter.h:318
#19 EidosInterpreter::EvaluateInternalBlock (this=0x7fffffffdca0, p_script_for_block=<optimized out>) at /usr/src/debug/slim-simulator/SLiM/eidos/eidos_interpreter.cpp:188
#20 0x000055555568a339 in Community::ExecuteEidosEvent (this=this@entry=0x55555653e9f0, p_script_block=0x555555dd2e80)
    at /usr/src/debug/slim-simulator/SLiM/core/community.cpp:2146
#21 0x000055555568a7c5 in Community::_RunOneTickWF (this=0x55555653e9f0) at /usr/src/debug/slim-simulator/SLiM/core/community.cpp:2312
#22 0x0000555555742021 in SLiMAssertScriptStop (
    p_script_string="initialize() { initializeTreeSeq(); } initialize() { initializeMutationRate(1e-7); initializeMutationType('m1', 0.5, 'f', 0.0); initializeGenomicElementType('g1', m1, 1.0); initializeGenomicElement(g1"..., p_lineNumber=2101) at /usr/src/debug/slim-simulator/SLiM/core/slim_test.cpp:239
#23 0x00005555557be483 in _RunTreeSeqTests (temp_path=...) at /usr/src/debug/slim-simulator/SLiM/core/slim_test_other.cpp:2101
#24 0x0000555555744773 in RunSLiMTests () at /usr/src/debug/slim-simulator/SLiM/core/slim_test.cpp:384
#25 0x0000555555668b63 in main (argc=<optimized out>, argv=0x7fffffffe6a8) at /usr/src/debug/slim-simulator/SLiM/core/main.cpp:366
bhaller commented 5 months ago

Ah, great detective work @grahamgower! The code is technically safe – it would never cause an actual buffer overrun as written – but it uses snprintf() in a way that looks unsafe, and I can see why FORTIFY_SOURCE would dislike it. I've never heard of FORTIFY_SOURCE, and I'm not sure whether Clang on macOS supports it – I turned it on in my build, but the self-check still runs fine for me. So, hmm. I've just pushed a fix that I think ought to resolve the problem, but I couldn't test it on my machine since FORTIFY_SOURCE doesn't seem to work on my machine. @grahamgower can you test the fix I pushed?

Assuming it works, I guess the next question is what to do about it. I don't really want to roll a new SLiM release just for this, because:

So I think I'll let it go until the next SLiM release, whenever that is. Once it has been confirmed that my push fixes this issue, I'll close it. Thanks folks!

grahamgower commented 5 months ago

I love your work @bhaller - your fix works and no other issues are apparent. No need for you to roll out another release. I've just bumped the PKGBUILD file to point at the latest commit, which automagically increments the package version from 4.2.r0.g6d00ad37 to 4.2.r1.g5b110b92.

Also, thanks very much to @alfroids for reporting the problem! You should be able to install from the updated PKGBUILD. I'm not familiar with paru, but I guess this will be automatic when you next try, or there will be an explicit way to refresh/resync with the upstream package sources.

bhaller commented 5 months ago

I love your work @bhaller - your fix works and no other issues are apparent. No need for you to roll out another release. I've just bumped the PKGBUILD file to point at the latest commit, which automagically increments the package version from 4.2.r0.g6d00ad37 to 4.2.r1.g5b110b92.

Thanks @grahamgower, that sounds perfect. This is the first commit since the 4.2 release, so this is a clean update to the release that patches just this one problem. :->

Also, thanks very much to @alfroids for reporting the problem! You should be able to install from the updated PKGBUILD. I'm not familiar with paru, but I guess this will be automatic when you next try, or there will be an explicit way to refresh/resync with the upstream package sources.

Sounds great. @alfroids, let me know if you encounter any other issues, otherwise I think you're enrolled in a workshop, so I'll see you there! :->

alfroids commented 5 months ago

Thank you very much guys! See you there @bhaller!

jshelly commented 5 months ago

I am getting this same buffer overflow issue with the slim -testSLiM test. I have created a fedora container using Singularity and the following def file (plus some %help and %labels text, excluded here):

BootStrap: docker
From: fedora:39

%post
    dnf clean all && rm -r /var/cache/dnf
    dnf -y install dnf-plugins-core
    dnf -y copr enable bacarson/SLiM-Selection_on_Linked_Mutations
    dnf -y remove mesa-dri-drivers mesa-filesystem mesa-libEGL mesa-libGL mesa-libgbm mesa-libglapi mesa-va-drivers
    dnf -y install mesa-dri-drivers-23.2.1-2.fc39.x86_64
    dnf -y install mesa-filesystem-23.2.1-2.fc39.x86_64
    dnf -y install mesa-libEGL-23.2.1-2.fc39.x86_64
    dnf -y install mesa-libGL-23.2.1-2.fc39.x86_64
    dnf -y install mesa-libgbm-23.2.1-2.fc39.x86_64
    dnf -y install mesa-libglapi-23.2.1-2.fc39.x86_64
    dnf -y install mesa-va-drivers-23.2.1-2.fc39.x86_64
    dnf -y install SLiM-4.2-1.fc39
    mkdir -p /tmp /gpfs1 /gpfs2 /gpfs3
    export LANGUAGE=en_US.UTF-8
    export LANG=en_US.UTF-8

As an aside, the mesa downgrade was required with fedora 39.

I run the tests and get:

$ ./slim -testEidos

SUCCESS count: 6953

$ ./slim -testSLiM
*** buffer overflow detected ***: terminated
Aborted

I also ran this in a writable singularity sandbox and using the singularity %test block, with the same result. I went back and checked a prior container that I had made for SLiM version 4.1 and found the same issue exists there as well.

I created the containers using sylab's cloud service. My host machine is running RHEL 7.9, and I am using Singularity 3.7.1.

Do you have any ideas or suggestions for me?

Thank you, Shelly Johnson UVM

bhaller commented 5 months ago

I am getting this same buffer overflow issue with the slim -testSLiM test.... Do you have any ideas or suggestions for me?

Hi @jshelly (and anybody else reading this). If you need to use treeSeqOutput() in your model's script, then you need the bug fix; in that case, you will need to make your own build from sources. I would recommend using the sources from GitHub at the point of the bug fix. That is this commit, SHA 5b110b92470d7cb4847a66309547984ca58d5ee0. Chapter 2 of the SLiM manual has instructions on building from sources.

If you do not need to use treeSeqOutput() in your model's script, then the self-test error is the only problem you should encounter. In that case, you can simply use the installed version of slim and ignore the self-test error.

Now that I know this is biting some people who are not on Arch, and thus has a broader impact than I realized, I will consider doing a SLiM 4.2.1 release to fix this problem. If that happens, it would probably be in 1-2 weeks.

bhaller commented 5 months ago

@bryce-carson I could use a bit of advice regarding handling this problem. @grahamgower shifted the Arch installer to incorporate the fix for this problem. It now appears that it occurs for Fedora as well, at least in the specific environment described above. Basically, any Linux platform that has changed its install/build system to turn on FORTIFY_SOURCE (only at the full level 3, I think?) will hit this issue, I guess. Do you have a sense of how many users are likely to be affected by this problem? And would it be easy for you to shift over the installers for affected platforms, the way Graham did, without doing a new release? I'd prefer to avoid doing a 4.2.1 release for this; a silent patch of the 4.2 release would be preferable I think. Or if you think very few users will be affected by this, we can just ignore the problem and point them to the previous comment when they come to us for help. Thoughts?

jshelly commented 5 months ago

Thanks for the quick response @bhaller! This is being provided centrally as a module for users on our HPC cluster. So, it is important to provide a bug free installation, since I cannot be sure what functions will be used/needed. I will check with the known users and see if they don't mind waiting a week or two for the RPM fix. Otherwise I'll take a look at the instructions to build from source. Thanks again!

bhaller commented 5 months ago

Thanks for the quick response @bhaller! This is being provided centrally as a module for users on our HPC cluster. So, it is important to provide a bug free installation, since I cannot be sure what functions will be used/needed. I will check with the known users and see if they don't mind waiting a week or two for the RPM fix. Otherwise I'll take a look at the instructions to build from source. Thanks again!

OK. Note that at this time it is not clear that there will be an RPM fix; that is to be determined. Building from source is actually very easy, especially if you don't need to build SLiMgui (which, on an HPC cluster, you would presumably not need to do).

jshelly commented 5 months ago

Thanks for the clarification. Actually, our users do want and need to use SLiMgui. They use it through our Open OnDemand Desktop interface. It's quite an easy way for them to use it in a cluster environment. I'm more concerned about having all of the right dependencies, which are easier to deal with via RPM in the container. With an older version of RedHat and also a some administrative hurdles to get things like Qt installed system wide via yum install, it could prove a bit more tedious than it otherwise would be. I would have to crosscheck what dependencies might need to be installed. It might be "easier" to install from source into a container rather than directly on the host.

bryce-carson commented 5 months ago

Thanks for the clarification. Actually, our users do want and need to use SLiMgui. They use it through our Open OnDemand Desktop interface. It's quite an easy way for them to use it in a cluster environment. I'm more concerned about having all of the right dependencies, which are easier to deal with via RPM in the container. With an older version of RedHat and also a some administrative hurdles to get things like Qt installed system wide via yum install, it could prove a bit more tedious than it otherwise would be. I would have to crosscheck what dependencies might need to be installed. It might be "easier" to install from source into a container rather than directly on the host.

Hi @jshelly, thanks for joining the issue discussion and informing us about that the issue is also present on RedHat systems. I'll bump the release in the package and initiate a new build on COPR. You should be able to update the packages on your system in an hour or so.

justbennet commented 5 months ago

That's super, and a huge thank you, @bryce-carson !

bhaller commented 5 months ago

I'll reopen this issue for the present moment. I'll close it again when @bryce-carson tells me the COPR build is all good, and announce the fix on slim-announce and slim-discuss. Thanks everybody!

justbennet commented 5 months ago

@bhaller Was just wondering if it's within the rules to suggest a change to the Issue description. As it turned out, the problem was bigger than Arch Linux, and I'm wondering if changing the description to something like

slim -testSLiM: buffer overflow detected

might make life easier for anyone else who comes across this? Just a thought....

bhaller commented 5 months ago

@bhaller Was just wondering if it's within the rules to suggest a change to the Issue description.

Good idea, done!

jshelly commented 5 months ago

Excellent news. Thanks for your quick work on this, @bryce-carson and @bhaller!!

bryce-carson commented 5 months ago

Excellent news. Thanks for your quick work on this, @bryce-carson and @bhaller!!

You're welcome, @jshelly. The patched RPM is now available from the Copr repository. Upgrade the package with dnf (sudo dnf upgrade SLiM), then all should be well. If you'd like, you may run the tests to confirm everything is well.

@bhaller, if you can write a new section in the Makefile to run the tests, so that make test can be called, that'd be useful for the packaging process to detect issues like these in the future (it's not needed for the patched RPM; that's already live). I tried to locate the binaries and call the tests myself, but I couldn't find them where I expected them to be. It's easier to hand the work off to CMake since it magically knows where the binaries are.

grahamgower commented 5 months ago

I'd like to provide a bit more clarity about when this bug should be visible... It requires that SLiM was built from source with -D_FORTIFY_SOURCE=3 in the CXXFLAGS, not that this is set somewhere at runtime. Clearly, this could happen if the the user builds from source manually. However, if folks are seeing this after installing a binary package, that means that the build environment where the packages were built (e.g. the Fedora rpm) contain this as well. For this not to have been noticed at build time means that the test cases weren't run after building the package—if that is the case then I highly recommend that gets fixed and that we double check that the tests are run on all the prebuilt packages as part of their build process!

FYI, it seems that several of the major Linux distributions are setting -D_FORTIFY_SOURCE=3 when building all of their release packages (e.g. Redhat, SUSE, Arch). This means that this setting is likely to be inherited in a lot of environments, such as those where third-party packages like SLiM get built.

bryce-carson commented 5 months ago

Indeed. The tests were not run in the RPM package, which I'm really wondering why now. I know I previously had tests run, but I must've removed it when I refactored the spec file.

I tried adding them back in, but there was some difficulty in finding the binaries. I'll try again next Saturday.

jshelly commented 5 months ago

You're welcome, @jshelly. The patched RPM is now available from the Copr repository. Upgrade the package with dnf (sudo dnf upgrade SLiM), then all should be well. If you'd like, you may run the tests to confirm everything is well.

Thanks, @bryce-carson. I've installed the update. Ended up creating a new container with dnf -y install SLiM-4.2-2.fc39 and all is well. slim -testSLiM now completes with SUCCESS.

bryce-carson commented 5 months ago

You're welcome, @jshelly. The patched RPM is now available from the Copr repository. Upgrade the package with dnf (sudo dnf upgrade SLiM), then all should be well. If you'd like, you may run the tests to confirm everything is well.

Thanks, @bryce-carson. I've installed the update. Ended up creating a new container with dnf -y install SLiM-4.2-2.fc39 and all is well. slim -testSLiM now completes with SUCCESS.

Glad to hear!

bhaller commented 5 months ago

Just made new issue #438 to represent your request, @bryce-carson. Announced the problem and the fix on slim-discuss and slim-announce. I think we're done here. I hope folks had a nice eclipse experience, where possible! :->