MesserLab / SLiM

SLiM is a genetically explicit forward simulation software package for population genetics and evolutionary biology. It is highly flexible, with a built-in scripting language, and has a cross-platform graphical modeling environment called SLiMgui.
https://messerlab.org/slim/
GNU General Public License v3.0
160 stars 30 forks source link

LTO-related compilation problems #33

Open bhaller opened 5 years ago

bhaller commented 5 years ago

Hi @molpopgen. The LTO stuff you contributed to SLiM in now out in the 3.2.1 release (thanks!), and a user sent me the following:

I had a little trouble compiling on my ubuntu 14.04 machine.

cmake seemed to work fine. But there were many compilation errors similar to: /usr/bin/ranlib: pow_int.c.o: plugin needed to handle lto object

I added the following lines to CMakeLists.txt:

SET(CMAKE_AR "gcc-ar") SET(CMAKE_C_ARCHIVE_CREATE " qcs ") SET(CMAKE_C_ARCHIVE_FINISH true)

SET(CMAKE_CXX_ARCHIVE_CREATE " qcs ") SET(CMAKE_CXX_ARCHIVE_FINISH true)

It then compiled, and all tests passed. Thought I would pass this on, in case it might be of use to others (though possibly it is an unusual issue, or easily solved by someone that compiles a lot of c++).

I have no idea idea what this all means; I'm hoping you do. :-> I'm not sure where in CMakeLists.txt to add that, much less what it means or what the problem is. If you grok this, could you possibly submit a pull request that fixes it? Thanks!

molpopgen commented 5 years ago

Not enough info. That is an old distro, so we need the compiler version. You may also be able to test on Travis?

bhaller commented 5 years ago

OK. I've alerted the user who reported the problem to the existence of this Github issue; hopefully he will reply here with more details. Thanks.

jasongbragg commented 5 years ago

This might be an obscure issue that will affect very few. If so, sorry to be a pest! It affected one machine I tried, and did not affect another.

The machine where SLiM 3 did not compile was ubuntu 16.04 (my error, sorry), with the following compiler info:

$ g++ --version g++ (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609 Copyright (C) 2015 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

On another machine, running ubuntu 14.04 and c++ 4.8.4, everything compiled.

Some of the compiler errors that were observed: /usr/bin/ar: CMakeFiles/gsl.dir/gsl/cblas/xerbla.c.o: plugin needed to handle lto object /usr/bin/ar: CMakeFiles/gsl.dir/gsl/cblas/dgemv.c.o: plugin needed to handle lto object /usr/bin/ar: CMakeFiles/gsl.dir/gsl/cblas/dtrmv.c.o: plugin needed to handle lto object

/usr/bin/ranlib: xerbla.c.o: plugin needed to handle lto object /usr/bin/ranlib: dgemv.c.o: plugin needed to handle lto object /usr/bin/ranlib: dtrmv.c.o: plugin needed to handle lto object

The following post seemed to replicate the errors, and suggested a solution, which seemed to work: https://stackoverflow.com/questions/39236917/using-gccs-link-time-optimization-with-static-linked-libraries

bhaller commented 5 years ago

@molpopgen It's interesting that the errors seem to be in the GSL code. SLiM includes the GSL files that it needs within its own project; it does not have an external link dependency on the GSL. However, perhaps on Jason's 16.04 machine it was trying to link against his installed GSL somehow, and that didn't have the LTO compilation support that was needed? That's just a wild guess, I don't really understand any of this. :-> Anyway, since Jason is the only person who has reported this problem, and it's on a machine that he says is unusually configured, punting might be reasonable. If the fix makes sense to you, though, and seems harmless, then possibly it would make sense to take it...?

molpopgen commented 5 years ago

I worry that the fix is incorrect when GCC is not the compiler. The 'gcc-ar' bit is a red flag.

The SO post talks about static libs, but slim can't be making one, as osx doesn't support them.

I'm tempted to say close the issue because it isn't reproducible.

On Mon, Feb 11, 2019, 1:40 AM Ben Haller <notifications@github.com wrote:

@molpopgen https://github.com/molpopgen It's interesting that the errors seem to be in the GSL code. SLiM includes the GSL files that it needs within its own project; it does not have an external link dependency on the GSL. However, perhaps on Jason's 16.04 machine it was trying to link against his installed GSL somehow, and that didn't have the LTO compilation support that was needed? That's just a wild guess, I don't really understand any of this. :-> Anyway, since Jason is the only person who has reported this problem, and it's on a machine that he says is unusually configured, punting might be reasonable. If the fix makes sense to you, though, and seems harmless, then possibly it would make sense to take it...?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MesserLab/SLiM/issues/33#issuecomment-462196967, or mute the thread https://github.com/notifications/unsubscribe-auth/AGHnH0RekHFSavmXRq189DdqptmudewYks5vMLvNgaJpZM4ayIqH .

bhaller commented 5 years ago

OK, I'll do that for now; it can always be reopened if someone else encounters this issue.

bhaller commented 5 years ago

Hi @molpopgen. Another user has reported the same issue, so I'm reopening this issue. This user is on Red Hat Enterprise Linux Server release 6.6 (Santiago), using g++ 7.3.0 – a different platform and a different compiler version than the previous user. His compile produced the same "plugin needed to handle lto object" errors as for the other user, and ultimately failed. He reports that more or less the same fix works for him:

SET(CMAKE_AR "gcc-ar") SET(CMAKE_C_ARCHIVE_CREATE " qcs ") SET(CMAKE_C_ARCHIVE_FINISH true)

But I think you're right that this fix seems gcc-specific. Is there a way to enclose the fix lines inside something that says "do this only if the compiler is gcc"?

I don't know why this bites only certain people; I think the LTO stuff is working fine for most people. Nevertheless, it is proving to be a hassle. I've done some timing test, and it looks like the LTO fix is producing a measurable speedup, but an extremely small one – well under 1% for most models. So if there's not a simple fix here I'm tempted to pull the LTO change.

molpopgen commented 5 years ago

I'm sure a conditional application is possible, but I'm not sure how. You'll have to Google that one. I am much more familiar with GNU autotools than I am with cmake. Perhaps @petreharp knows, or can ask his local expert, who I think is a cmake expert.

On Thu, Feb 21, 2019, 6:38 PM Ben Haller notifications@github.com wrote:

Hi @molpopgen https://github.com/molpopgen. Another user has reported the same issue, so I'm reopening this issue. This user is on Red Hat Enterprise Linux Server release 6.6 (Santiago), using g++ 7.3.0 – a different platform and a different compiler version than the previous user. His compile produced the same "plugin needed to handle lto object" errors as for the other user, and ultimately failed. He reports that more or less the same fix works for him:

SET(CMAKE_AR "gcc-ar") SET(CMAKE_C_ARCHIVE_CREATE " qcs ") SET(CMAKE_C_ARCHIVE_FINISH true)

But I think you're right that this fix seems gcc-specific. Is there a way to enclose the fix lines inside something that says "do this only if the compiler is gcc"?

I don't know why this bites only certain people; I think the LTO stuff is working fine for most people. Nevertheless, it is proving to be a hassle. I've done some timing test, and it looks like the LTO fix is producing a measurable speedup, but an extremely small one – well under 1% for most models. So if there's not a simple fix here I'm tempted to pull the LTO change.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MesserLab/SLiM/issues/33#issuecomment-466251382, or mute the thread https://github.com/notifications/unsubscribe-auth/AGHnH-eEYO-uhj9AR_NMuopPdfeO9lcJks5vP1f2gaJpZM4ayIqH .

bhaller commented 5 years ago

OK. Indeed, @petrelharp & co. are the ones who set up cmake for SLiM in the first place; I have only a vague understanding of it.

petrelharp commented 5 years ago

Can you point me at what these LTO changes were?

bhaller commented 5 years ago

@petrelharp, here: https://github.com/MesserLab/SLiM/pull/28

douglasgscofield commented 5 years ago

Here to say, found the same thing in a large compute cluster compiling version 3.2.1. We're running 'CentOS Linux release 7.6.1810 (Core)' and I used cmake/3.13.2 and gcc/7.4.0 for compilation.

I got around it two different ways, either:

I'm sure these solutions are specific to using gcc, and I don't know CMake so I can't suggest alternative code for a pull request.

bhaller commented 5 years ago

@petrelharp, is anybody on your end working on this issue? If not, I think I might just remove the LTO stuff from the cmake file; the performance difference it makes is not large, and too many people are running into this issue. If we don't have a general fix for this, I think we should just pull LTO until such time as we do. @molpopgen, thoughts?

molpopgen commented 5 years ago

Fine to comment it out for the time being. There must be a way to apply this "fix" if the compiler is GCC, though, and presumably there is CI in place to make sure that clang is not affected by any changes? __ Kevin Thornton Associate Professor Ecology and Evolutionary Biology UC Irvine www.molpopgen.org github.com/molpopgen github.com/ThorntonLab

On Wed, Mar 13, 2019 at 5:59 AM Ben Haller notifications@github.com wrote:

@petrelharp https://github.com/petrelharp, is anybody on your end working on this issue? If not, I think I might just remove the LTO stuff from the cmake file; the performance difference it makes is not large, and too many people are running into this issue. If we don't have a general fix for this, I think we should just pull LTO until such time as we do. @molpopgen https://github.com/molpopgen, thoughts?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MesserLab/SLiM/issues/33#issuecomment-472411080, or mute the thread https://github.com/notifications/unsubscribe-auth/AGHnHxCG8GkJf1i6EsZIl2Gem81nr0ylks5vWPXcgaJpZM4ayIqH .

bhaller commented 5 years ago

I have just commented out the LTO stuff for now. I plan to leave this issue open until such time as somebody figures out how to put the LTO stuff back in without breaking people's builds; it would be nice to have it enabled.

gshamov commented 5 years ago

First: I do have a similar problem building on CentOS7 with GCC 7.3.0 . LTO is detected by CMake, but then for every GSL and Table object I get messages about it. Like,

ranlib: vector.c.o: plugin needed to handle lto object

Second: why would you think that using a canned GSL with BLAS built from C sources is a good idea? Most HPC systems (like ours) do provide (m)any versions of GSL, and we have vendor-specified linear algebra libraries like OpenBLAS or MKL which will be easily an order of magnitude faster.

molpopgen commented 5 years ago

@bhaller It is interesting to note that this issue seems only to be reported from CentOS users so far?

bhaller commented 5 years ago

@bhaller It is interesting to note that this issue seems only to be reported from CentOS users so far?

@molpopgen and ubuntu, as with the original reporter (see top), right? But I'm not a Linux person at all, so for all I know those are the same thing. :->

bhaller commented 5 years ago

First: I do have a similar problem building on CentOS7 with GCC 7.3.0 . LTO is detected by CMake, but then for every GSL and Table object I get messages about it. Like,

ranlib: vector.c.o: plugin needed to handle lto object

OK, thanks for the report. The LTO stuff is now disabled in the GitHub head version of SLiM, so that should build fine for you. That will be released as SLiM 3.3 soon.

Second: why would you think that using a canned GSL with BLAS built from C sources is a good idea? Most HPC systems (like ours) do provide (m)any versions of GSL, and we have vendor-specified linear algebra libraries like OpenBLAS or MKL which will be easily an order of magnitude faster.

I'm happy to discuss this – interesting question! – but it doesn't belong in this thread; perhaps you could open a new issue?

molpopgen commented 5 years ago

That initial report on Ubuntu 14 isn't really reproducible. That version is now 5 years old and the tool chain is ancient. Everything works just fine on all current versions including continuous integration services. It would be useful to know if people reporting problems are using tool chains from the distribution or if they have been upgraded after the fact.

On Mon, May 13, 2019, 12:45 PM Ben Haller notifications@github.com wrote:

@bhaller https://github.com/bhaller It is interesting to note that this issue seems only to be reported from CentOS users so far?

@molpopgen https://github.com/molpopgen and ubuntu, as with the original reporter (see top), right? But I'm not a Linux person at all, so for all I know those are the same thing. :->

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MesserLab/SLiM/issues/33?email_source=notifications&email_token=ABQ6OH6L75PSOSESOJHTLT3PVHAI3A5CNFSM4GWIRKD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVJLIHQ#issuecomment-491959326, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQ6OH6UIKGBOL2GOVNRFTLPVHAI3ANCNFSM4GWIRKDQ .

molpopgen commented 5 years ago

Looking back on the first report: 14.04 worked, but 16.04 didn't. That's odd, IMO. But, things happen. My development box where fwdpy11 is tested is 16.04 with GCC 5.5 and everything works fine there. One key difference is that I am not trying to compile GCC on that system ever.

bhaller commented 5 years ago

Also someone on Red Hat Enterprise Linux Server release 6.6 (Santiago), above. Seems like a pretty mixed bag. I don't know why it bites particular people and not others.

molpopgen commented 5 years ago

RHEL and centos are "the same". The former is the Enterprise version of the latter.

On Mon, May 13, 2019, 7:11 PM Ben Haller notifications@github.com wrote:

Also someone on Red Hat Enterprise Linux Server release 6.6 (Santiago), above. Seems like a pretty mixed bag. I don't know why it bites particular people and not others.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MesserLab/SLiM/issues/33?email_source=notifications&email_token=ABQ6OH4WCGYPNW2GXXVJL7LPVINRXA5CNFSM4GWIRKD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVKBWPA#issuecomment-492051260, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQ6OH4CF4Z5DSUQV37HTQTPVINRXANCNFSM4GWIRKDQ .

molpopgen commented 5 years ago

Also someone on Red Hat Enterprise Linux Server release 6.6 (Santiago), above. Seems like a pretty mixed bag. I don't know why it bites particular people and not others.

What may be happening is that more errors are being seen on CentOS/RHEL because they are commonly deployed on clusters. The Debian family (including Ubuntu and Pop OS) seem to find their home primarily on desktop/laptop/workstation setups, which may be more rare.

bhaller commented 5 years ago

What may be happening is that more errors are being seen on CentOS/RHEL because they are commonly deployed on clusters. The Debian family (including Ubuntu and Pop OS) seem to find their home primarily on desktop/laptop/workstation setups, which may be more rare.

That makes sense, especially since many people using SLiM on a desktop will be using macOS, and thus the double-click installer or Xcode, rather than building at the command line.

molpopgen commented 5 years ago

One option for a short-term fix is to allow opt-in for FLTO. The default would be OFF, and then invoking something like cmake . -DUSE_FLTO=1 would enable it.

bhaller commented 5 years ago

I'm inclined to leave it as is until a complete fix comes along. The performance gains I measured were quite small, so it's not really worth complicating the story for users; if it can work automatically that's great, but if it requires a switch, documentation, etc., then meh. Thanks for thinking about it, though, I appreciate it.

signalogic commented 1 year ago

For anyone who arrives here after hours of scouring stackoverflow and other forums, on Ubuntu 16.04 with ldd 2.23 or 2.24 we had to turn off lto (i.e. not use -flto in compile flags) or else we would see linker messages like "hidden symbol `our_sym' in /tmp/ccuYFfn5.ltrans4.ltrans.o is referenced by DSO". We tried changes to link order, applying attribute ((visibility ("default"))) to specific symbols, -Wl,-export_symbol, etc. We continuously test on gcc versions from 4.6.4 to 11.3 and so far only these versions of ldd have this issue (happened with gcc 6.2, 6.5, and 7.4)