Closed bryce-carson closed 2 months ago
Whether the patch is "upstreamed," and included in the official SLiM sources, or is only distributed in the RPM as a patch is up to you, @bhaller. It depends if you're able to successfully detect the version of the Linux distribution that SLiM is building on in the CMakeLists.txt file.
You can get a glimpse of what distribution is used with the command uname -a
, but that's a fragile way to do it. I don't know of betters ways myself, but I'm sure there's a CMake plugin or script somewhere that does the job somewhat reliably.
Gah. I'm not good at CMake stuff myself; the original CMake was put together by someone else, I only patch it up. Even if we could detect the distro reliably in CMake, what would we actually do about it? It sounds like a bug somewhere in the system that got fixed later; not clear how we'd work around it.
I guess it looks like Rupert is the only person bitten by this, since it has been dead for a long time and he's the first to complain, right? He says this RHEL 8 thing is supported until '29 or something; but how many people are actually using it? Simplest seems to me to just say it isn't supported, and tell those folks (i.e., Rupert) to build from sources. Do you disagree?
In fact, pondering this further – if they claim to support this platform until '29, then they should fix this issue themselves, upstream of us. Programs like ours shouldn't have to work around bugs in the distro, if it's still supported – especially since it looks like the bug was found and fixed already, since it doesn't manifest in RHEL 9. Anyhow, I feel like this has a huge Somebody Else's Problem field around it.
Gah. I'm not good at CMake stuff myself; the original CMake was put together by someone else, I only patch it up. Even if we could detect the distro reliably in CMake, what would we actually do about it? It sounds like a bug somewhere in the system that got fixed later; not clear how we'd work around it.
Indeed! I suppose if we did work around the issue it would be moving the binaries or changing directory before linking, or renaming the directories containing the object files before linking. :shrug:
I guess it looks like Rupert is the only person bitten by this, since it has been dead for a long time and he's the first to complain, right? He says this RHEL 8 thing is supported until '29 or something; but how many people are actually using it? Simplest seems to me to just say it isn't supported, and tell those folks (i.e., Rupert) to build from sources. Do you disagree?
Nah, we should keep supporting it. RHEL 8 is indeed supported until 2029. Some sites might upgrade from 7 to 8, and rather than directly to 9, and 7 is only just going EOL in a couple months (7 is unsupported by us, for sure).
There may be other users as well.
Given we previously supported the os, dumping it when it would otherwise work fine because of a regression (in an enterprise os nonetheless!) seems inappropriate.
In fact, pondering this further – if they claim to support this platform until '29, then they should fix this issue themselves, upstream of us. Programs like ours shouldn't have to work around bugs in the distro, if it's still supported – especially since it looks like the bug was found and fixed already, since it doesn't manifest in RHEL 9. Anyhow, I feel like this has a huge Somebody Else's Problem field around it.
That's an excellent point! The problem isn't ours; it is an issue with RHEL 8 indeed. I'll open a ticket on the RHEL Bugzilla (if I can) and have the issue resolved.
I was thinking there may have been some interim change that we'd need to deal with, but given the binaries linked correctly on RHEL 8 in an earlier version of SLiM, but now doesn't, means there's a regression.
... Anyhow, I feel like this has a huge Somebody Else's Problem field around it.
That's an excellent point! The problem isn't ours; it is an issue with RHEL 8 indeed. I'll open a ticket on the RHEL Bugzilla (if I can) and have the issue resolved.
I was thinking there may have been some interim change that we'd need to deal with, but given the binaries linked correctly on RHEL 8 in an earlier version of SLiM, but now doesn't, means there's a regression.
Yeah, seems that way to me. Thanks for opening a ticket. Let's see what happens with it. :->
Hi @bryce-carson. Can you put a link to the ticket you opened here? And, is there any sign that they intend to do anything about it? How much of a problem is this in practice – I guess only one person has complained about it so far? RHEL is on 9 now; if they don't want to fix the bug in RHEL 8, I guess we can just tell people that they should update to RHEL 9 if they want the bug fixed? Do you think this issue should remain open here – is there anything further that you plan to do with it? I'm comfortable marking it wontfix and closing it, especially once the link to the RHEL ticket you opened is added here.
Hi @bryce-carson. Can you put a link to the ticket you opened here?
I'll need to find it. I didn't bookmark it, if I indeed created the ticket (I really hope I did, given this issue is three months old now). I'm looking through my browser history and I logged into BugZilla with one of my accounts to search through my tickets; I may have other BugZilla accounts, I'm unsure. I'll also search BugZilla rather than looking through my open tickets on various accounts.
@bhaller Oh dear. It appears I let creating the ticket slide off of my TODO list without actually opening the ticket; I'm sorry. I'm opening the ticket now.
Do you think this issue should remain open here – is there anything further that you plan to do with it?
Hmm... labelling it wontfix is okay, but since the issue still affects us and may affect a user, who might not be clever and search through closed issues as well as open issues, maybe we should leave it open so it is easier to find? It's up to you; an issue we cannot fix, and which we've already addressed as much as possible isn't an open issue, it is a shut case.
Closing the issue only has the benefit of hiding the issue by default and reducing the open issue counter; labelling it WONT-FIX and leaving it open has more benefit, I think. The user-base might open a duplicate issue or email us if they cannot find it immediately.
I'm comfortable marking it wontfix and closing it, especially once the link to the RHEL ticket you opened is added here.
Indeed. I'm working on opening the ticket now.
I guess we can just tell people that they should update to RHEL 9 if they want the bug fixed?
For now we aren't building/shipping RPM packages for RHEL 8, or derivative distributions with the EPEL 8 repository enabled (Extra Packages for Enterprise Linux 8), so yes.
For reference, this is the build of the latest RPM package version on COPR, and I have disabled the RHEL 8 and EPEL 8 builders. If you happen to play around and look at the other builds you'll see that I resubmitted a build for just RHEL 8 and EPEL 8; that's for an older RPM package version, and only exists so that I can let it fail. The older builds from three months ago were automatically deleted, so I'm just regenerating logs to attach with the bug report. Having a failed build won't affect anyone using the COPR repo.
Thanks @bryce-carson. Leaving it open but marked wontfix, as you recommended. We can close it when they fix their ticket, or when we decide nobody cares about RHEL 8 any more, I guess.
A Red Hat engineer has looked at the issue and helped me identify that a possible reason this is failing is not an issue with ld
itself, but our CMake build system and rpmbuild. The version of rpmbuild used on RHEL 8 on the COPR builders is suspect, Nick Clifton suggested; that or CMake is behaving strangely.
https://issues.redhat.com/browse/RHEL-49714
I'm just checking in; this comment probably isn't that helpful to read.
Sounds like a conflict between different uses of the name of eidos
, and probably a CMake bug, yes. Hmm. A subdirectory named eidos
is being made to hold intermediate build products or something, but CMake also wants to use that name for a library it builds and links to with ld
. Normally the library would be named eidos.o
or eidos.dylib
or something, I think, and so there would not be a conflict? Anyhow, I'm just stabbing in the dark based on skimming that issue. If it is a CMake bug, we might or might not be able to find a workaround for it. At a minimum, we could have the CMake script check the CMake version number, and if it is the offending version, throw an error saying "This version of CMake has a bug; you need to update your CMake", I suppose. But I wonder why we're being bitten by this problem and (apparently) other people are using CMake on RHEL 8 happily. There must be something unusual about our usage pattern, such as having both a build product named eidos
(the command-line executable) and also an internal target named eidos
(the library that is a component of slim
and SLiMgui
). Renaming that internal target to something like eidoslib
would probably avoid the problem, I would guess. Do you have a way to test this now?
There must be something unusual about our usage pattern, such as having both a build product named
eidos
(the command-line executable) and also an internal target namedeidos
(the library that is a component ofslim
andSLiMgui
). Renaming that internal target to something likeeidoslib
would probably avoid the problem, I would guess. Do you have a way to test this now?
I don't have a testing methodology thought of, but I'll make note of it and see if I can get around to it some time.
This might be useful, though: https://discourse.cmake.org/t/how-to-use-the-same-name-for-a-target-and-a-library/5044/2.
I don't have a testing methodology thought of, but I'll make note of it and see if I can get around to it some time.
Great. I think I might know how to fix this issue, but I want a test first, so I can see the test go from red to green; otherwise who knows if I've actually fixed it? :-> So, ping me when you're able to test this, and then I'll proceed. Thanks!
This might be useful, though: https://discourse.cmake.org/t/how-to-use-the-same-name-for-a-target-and-a-library/5044/2.
It might indeed!
@bhaller, I don't have the time to invest into a test right now. I'm not sure what that'd look like anyways, other than a build on COPR. If you still have access to your COPR account you can try submitting a build.
@bhaller, I don't have the time to invest into a test right now. I'm not sure what that'd look like anyways, other than a build on COPR. If you still have access to your COPR account you can try submitting a build.
Oh, I wasn't suggesting an automated unit test or something. Just "a way to test that the problem is fixed". If a build on COPR suffices, then I'll fix this before the next SLiM release (whenever that happens), and then we'll see whether the fix worked in the COPR build. Does that work?
@bhaller, I don't have the time to invest into a test right now. I'm not sure what that'd look like anyways, other than a build on COPR. If you still have access to your COPR account you can try submitting a build.
Oh, I wasn't suggesting an automated unit test or something. Just "a way to test that the problem is fixed". If a build on COPR suffices, then I'll fix this before the next SLiM release (whenever that happens), and then we'll see whether the fix worked in the COPR build. Does that work?
Yep! The build will either work or it won't (it will say "Failed" or not).
OK, I've looked into this a bit more. As you wrote:
The object files being linked together are in the directory
CMakeFiles/eidos.dir/eidos/
; if the output binaryeidos
is emitted in the directory containing the path just mentioned, it is unclear why there is a conflict.
I think that is correct, and so I think the conflict is actually not between the directory CMakeFiles/eidos.dir/eidos/
and the output binary eidos; it is between the source directory eidos
and the output binary eidos
. That's my new hypothesis: the build process for these platforms attempts to do the build inside the source directory (which is a really bad idea precisely because of this kind of problem), and so there is a conflict with the source directory named eidos
.
If that is true, there are four possible solutions: change the name of the source directory (conceivable, but rather a pain, and would screw up the Git history), change the name of the target (unacceptable), change the build process to do the build outside of the source directory (strongly preferable), or change the location where the final target goes (presumably putting it in a subfolder called targets
or something, to avoid this conflict – conceivable, but will break everybody in the universe, but we could probably do it only in the situation where a name conflict is detected, or something).
Before I bark up this tree further, can you talk to your contact further, @bryce-carson? Can we find out whether I am correct in this hypothesis – that they are doing their builds directly inside the source directory itself, not even inside a subfolder that they create? Thanks!
OK, I've looked into this a bit more. As you wrote:
The object files being linked together are in the directory
CMakeFiles/eidos.dir/eidos/
; if the output binaryeidos
is emitted in the directory containing the path just mentioned, it is unclear why there is a conflict.I think that is correct, and so I think the conflict is actually not between the directory
CMakeFiles/eidos.dir/eidos/
and the output binary eidos; it is between the source directoryeidos
and the output binaryeidos
. That's my new hypothesis: the build process for these platforms attempts to do the build inside the source directory (which is a really bad idea precisely because of this kind of problem), and so there is a conflict with the source directory namedeidos
.If that is true, there are four possible solutions: change the name of the source directory (conceivable, but rather a pain, and would screw up the Git history), change the name of the target (unacceptable), change the build process to do the build outside of the source directory (strongly preferable), or change the location where the final target goes (presumably putting it in a subfolder called
targets
or something, to avoid this conflict – conceivable, but will break everybody in the universe, but we could probably do it only in the situation where a name conflict is detected, or something).
By default, CMake in RPM builds and on Fedora are out of source according to this documentation. When building an RPM on Fedora, you must use a documented by unsupported macro %__cmake_in_source_build
to achieve an in-source-directory build.
Knowing this is the default, and hasn't been a problem until this issue arose when something (nebulous, yes) changed in CMake shipped in the RHEL 8 repositories.
Before I bark up this tree further, can you talk to your contact further, @bryce-carson? Can we find out whether I am correct in this hypothesis – that they are doing their builds directly inside the source directory itself, not even inside a subfolder that they create? Thanks!
By default, the out of source builds are located in the following directory, with redefinition possible using the below macros (Fedora documentation converted to Markdown here for your convenience, Ben).
Build systems that support defining source/build directories via RPM macros:
Macros
%_vpath_srcdir
Path (relative to the RPM build directory) where the sources to be built are located (default: .
).%_vpath_builddir
Path (relative to the RPM build directory) where the built objects are located (default: %{_target_platform}
).%{_target_platform}
is where the built objects should be located, and apparently some directory eidos is conflicting with the target binary there, somehow?
This has been fixed by @bryce-carson in https://github.com/MesserLab/SLiM/pull/469 (yay!). Closing.
RHEL 8 uses the binutils package version 2.30-123.el8. The version of
ld
, the GNU linker, provided by this package is unable to resolve a conflict between the names of compiled binaries and directories in the build directory (note to self: which is created by CMake under the direction of mock and thereby rpmbuild):EPEL 8 link error (the EPEL 8 builder is RHEL 8 with the EPEL 8 repository enabled)
RHEL 8 link error
The object files being linked together are in the directory
CMakeFiles/eidos.dir/eidos/
; if the output binaryeidos
is emitted in the directory containing the path just mentioned, it is unclear why there is a conflict.There is a difference in the errors encountered by the EPEL 8 builder and the RHEL 8 builder: the EPEL 8 builder continues with the build and finds that it cannot link eidos or slimgui because of the name conflict, while RHEL 8 stops after the name conflict with eidos.
Whatever version of ld used on RHEL 8, there is difficulty in reproducing the build issue locally because I do not have a subscription to RHEL 8, and I cannot use mock locally to reproduce the build, preserve the build directory, and then examine the contents thereof. Exactly where
eidos
orslimgui
are emitted needs to be determined so the name conflict can be resolved with a patch.ld
; we're working on it."