Open jakirkham opened 3 years ago
This came up in a numpy issue uncovered by testing the rc's of 1.21.0 for conda-forge - in particular, a test fails due to a bug in glibc 2.12 (not present anymore in 2.17).
There would be a patch to work around the bug, but @rgommers asked:
CentOS 6 is >6 months past EOL. Official EOL really should be the last time to drop a constraint. The patch for working around this problem seems fine to accept, but exercises like these are a massive waste of time. For manylinux we're also fine to move to glibc 2.17 (= manylinux2014), for the same reason. What's the problem here exactly?
I brought this comment into the feedstock PR, where @xhochy noted:
I think we should in general move conda-forge to
cos7
but here is probably not the right place to discuss this. Probably we already have an issue for that.
Hence moving it here.
Also xref #1432
Was this issue discussed further at recent core meetings (I've occasionally seen public hackmd notes, but no idea where to find the a collection of them)?
Any statistics or arguments that go against doing this?
Assuming it should be done, this probably needs a migrator (for adding sysroot_linux-64 2.17
everywhere?). I'd be willing to help, but would need some more pointers.
It was. Nothing conclusive yet. We collect the meeting notes here
Informally we know there are still some CentOS 6 users (the long tail of support). That said, we do lack statistics either way. So this is something we discussed. Namely how best to collect them
Yeah I think we need to decide this is something we want to do first, which we havenβt done yet
I understand that some people are stuck on EOL'd OSes, but IMO the case to hold back based on that is really tenuous. If you're on an EOL OS, you eventually get no software updates anymore - why should conda-forge go out of its way to still service those users?
I have to agree with @rgommers' statement (I quoted) above - stuff like https://github.com/numpy/numpy/issues/19192 has a real cost. It probably bound 10-20h of maintainer (resp. core contributor) time in total, and would have been completely avoided without an ancient glibc.
Another datapoint: I now have a staged-recipes PR that cannot build because the GPU-build only has glibc 2.12 (pytorch >=1.8 needs 2.17), and the CentOS7 build doesn't start: https://github.com/conda-forge/staged-recipes/pull/16306
That's not a datapoint. We've documented this in our docs on how to use CentOS7.
That's not a datapoint. We've documented this in our docs on how to use CentOS7.
I know how to do it per-feedstock, but the above packaged cannot currently make it through staged recipes, or at least I'll need help to pull it off. Someone could also merge it and I fix things once the feedstock is created. But it's um... suboptimal... and definitely related to CentOS6, so I'd still call it a datapoint.
I know how to do it per-feedstock, but the above packaged cannot currently make it through staged recipes, or at least I'll need help to pull it off.
Have you tried doing the same in staged-recipes? It should work.
It does work on staged-recipes, see here for an example (CentOS 6 fails as expected but the CentOS 7 based job passes and the feedstock is generated correctly thanks to the conda-forge.yml
in the recipe directory.)
That said, I am noticing more and more places where CentOS 6 issues are appearing and moving a feedstock to CentOS 7 causes the downstream feedstocks to also need to be changed causing yet more manual intervention to be needed.
In the last few weeks, I've probably spent upwards of 15h chasing down bugs that ended up being resolved by moving to CentOS 7. This is a real cost. Same for less experienced contributors running into cryptic resolution errors for trying to package something that (now) needs a sysroot 2.17
, and end up abandoning the recipes.
@jakirkham: Informally we know there are still some CentOS 6 users (the long tail of support).
Can we quantify this? CentOS 6 is EOL for a year now. Why are we so beholden to that long tail? Are those parties contributing to conda-forge somehow (infra costs or packaging effort)? If not, why are we providing free support longer than even RedHat? More to the point: why do we accept them externalizing their costs for not dealing with 10+ year old software to conda-forge?
That said, we do lack statistics either way. So this is something we discussed. Namely how best to collect them
If it takes X months to collect those statistics, that is a bad trade-off IMO.
@conda-forge/core Does anyone have an objections to changing the default sysroot to CentOS 7? If not I'll make PRs to change it early next week.
I know of users this will impact.
What exactly is the problem with our current setup?
I also know users who this will effect, including myself. I also know people using CentOS 5-like systems with conda, who will continue to do so for at least the next decade so we can't wait until nobody is using CentOS 6 anymore.
What exactly is the problem with our current setup?
#define
s as they we're available when GCC itself was built).Over the last 6 months hundreds of hours must have been spent dealing with these issues and I'm not convinced hundreds more should be spent over the next six months. For people really stuck on CentOS 6 we could add a global label (likegcc7
and cf202003
) or they can go around forcing the old sysroot using the same mechanism as we currently use for upgrading to CentOS 7 if they really need to.
Global labels don't get repo data patching which at this point will render the channel likely wrong.
100% agree with what @chrisburr wrote. There are also some pretty gnarly bugs in the trigonometry functions of glibc < 2.17 that have bitten me at least 3 times already.
@beckermr: I know of users this will impact.
And they can keep using old packages, or use paid support for their ancient platforms. I empathise that there are some people between a rock and a hard place, but again:
why do we accept them externalizing their costs for not dealing with 10+ year old software to conda-forge?
Those 100s of hours Chris is mentioning might be "free" but they come at the cost of other things not being improved or fixed or packaged, and barring strong countervailing reasons, that's IMO a horrible trade-off to make against the ecosystem in favour of an unspecified handful of people who cannot manage to run less-than-decade-old software, yet need the newest packages.
Many folks stuck on an older centos are not there by choice. They are constrained by the lack of upgrades on big systems run by government labs, etc. The idea that they can simply pay for support is a non-starter to anyone who works in or understands how those organizations work.
I am bringing this up because the remedies for using cos6 that folks keep bringing up here are not really available to the people that need cos6.
We are making a choice to leave them behind when a majority of the software we build does not require cos6 at all.
I suspect a much better path would be to further improve our support for cos7 in smithy or our bots.
Many folks stuck on an older centos are not there by choice. They are constrained by the lack of upgrades on big systems run by government labs, etc.
If you are referring to DOE labs, last time I heard the BES office demanded a through upgrade from its facilities due to cybersecurity concerns (cc: @mrakitin) and I assume the similar mandates should also be posted by other offices.
@beckermr the legacy software on the legacy systems will keep running even if conda forge starts building on CentOS7. CentOS 6 was released literally 10 years ago. Government labs running inefficient HW and SW stack is not something anyone should encourage or promote. That hurts the economy, research and the environment. Those systems cost everyone time and money (along with conda forge people and contributors). My understanding is that both build performance and the performance of the built libs is different on Conda 6 vs 7, isn't this true?
Thanks for the responses everyone!
I don't see anyone addressing directly the points I raised. The cost here is the time for folks who need cos7 and don't know it when they are building a package. They see an odd error and it costs them time to track down. I 100% agree that this cost is real.
Moving the default to cos7 is one way to reduce this cost. However it is not the only way. My premise is that given the headache this will cause for cos6 users in general, and that fact that cos7 is not required the majority of the time, we're better off improving the tooling around cos7 so that maintainers can better use it.
Global labels don't get repo data patching which at this point will render the channel likely wrong.
Good point, I forgot about this. Hopefully the __glibc
contraint can be good enough to allow people to keep using the channel. π€
I suspect a much better path would be to further improve our support for cos7 in smithy or our bots.
This is might an option but I'm not sure it's easy to do the "right" thing and it might not even be possible. How do you see this working? I have two ideas and I think I would lean towards option 1 for simplicity.
The bot automatically migrates downstream feedstocks as soon an upstream feedstock moves to be CentOS 7-only.
Try to be smarter and use solvability as a constraint i.e.
Y
depends on X
X=1.1
was built with CentOS 6X=1.2
was built with CentOS 7X
should Y
build with? Use run_exports
to guide the process?I'm not sure how stable it will be and I suspect there are a lot of unstable edgecases. In particular what happens if both CentOS 6 and CentOS 7 are unsolvable?
Change the default docker image to be cos7 for all feedstocks, but keep the sysroot to be cos6. This would remove the solver errors.
Option 1 is not really correct. If python itself was cos7 only, we wouldn't need cos7 constraints/builds of noarch python packages.
Option 2 I am not following.
The core issue here as I understand it is that people are having trouble recognizing errors as being due to not having cos7 and when they do, they are having trouble enabling the cos7 build.
I think we could add an admin command that would convert a recipe to cos7 automatically in a pr. This should work all the time except for gnarly cases around cuda. It would more or less solve the issues around provisioning cos7 on feedstocks.
For recognizing errors as being due to cos7, that is a much tougher problem.
I like option 3 provided we trust the majority of packages to be using the correct sysroot? (i.e. from $BUILD_PREFIX
and not /
)
Good point @chrisburr. We might get another class of errors if the docker image doesn't match and the builds are wrong. Otoh those builds are actually wrong and so likely should be fixed anyways.
@chrisburr, there's no sysroot in /
in our docker images (unless there's a yum_requirements.txt
file)
Option 3 has been working well for us at Bioconda since March/April or so.
(I.e., using quay.io/condaforge/linux-anvil-cos7-x86_64
as the base but not requiring sysroot_linux-64=2.17
by default.)
One minor thing to keep in mind is then we don't automatically test our packages on CentOS 6 user space anymore -- which isn't necessarily a bad thing because we then test on CentOS 7's which we didn't before. (The corner cases where testing on COS 6 would matter I expect to be very few.)
Two things I'd like to know:
sysroot_linux-64>=2.17
?
(E.g., I encountered O_PATH
not being defined a couple of times (man 2 openat|grep 'O_PATH.*since'
says O_PATH (since Linux 2.6.39)
))(I guess 2. is just me being too protective and we probably/hopefully don't need it. I'd just like to avoid having maintainers having to argue around fulfilling the needs of a small/shrinking user base.)
The idea that they can simply pay for support is a non-starter to anyone who works in or understands how those organizations work.
I understand how these organisations work, but the point is why does it fall to conda-forge to bridge that gap, and more importantly, who decides that supporting these users is worth all the lost time and opportunity cost for everyone else.
Personally, I find it ~preposterous~ comically bizarre for a small band of volunteers to try to provide free maintenance longer than a behemoth enterprise like RedHat.
I don't see anyone addressing directly the points I raised.
I have trouble with that statement, because neither are you acknowledging any of the costs beyond the cryptic resolver errors. Here are some examples (1 2 3 4) of things that consumed hours of debugging and upstream maintainer time just because the feedstocks were on cos6 (i.e. the errors simply went away with cos7). I'm sure other people are hitting such issues as well - how many will not find the magic answer and just give up?
For example, how would anyone guess that the following is due to using an old glibc (for me it was sheer luck and persistence)?
E AssertionError:
E Arrays are not equal
E
E Mismatched elements: 2 / 108 (1.85%)
E Max absolute difference: 1.
E Max relative difference: 0.04761905
E x: array([[[ 0., 1., 2., 3., 4., 5.],
E [ 6., 7., 8., 9., 10., 11.],
E [12., 13., 14., 15., 16., 17.],...
E y: array([[[ 0, 1, 2, 3, 4, 5],
E [ 6, 7, 8, 9, 10, 11],
E [12, 13, 14, 15, 16, 17],...
Maintaining feedstock and debugging weird errors is enough work as it is, and accepting the possibility of such insanely hard to fix bugs is a huge cost. Vague gesturing that some organisations cannot move on is IMO not nearly a compelling enough argument for that and there should be more transparency in that decision making process.
I understand how these organisations work, but the point is why does it fall to conda-forge to bridge that gap, and more importantly, who decides that supporting these users is worth all the lost time and opportunity cost for everyone else.
Personally, I find it preposterous for a small band of volunteers to try to provide free maintenance longer than a behemoth enterprise like RedHat.
That's because core maintainers like @beckermr cares about older sysroot for their dayjob and I care about ppc64le for my dayjob. Unless you would like to lose core maintainers, let's not call it preposterous.
That's because core maintainers like @beckermr cares about older sysroot for their dayjob and I care about ppc64le for my dayjob.
That's a very different kind of argument than appeared so far.
Unless you would like to lose core maintainers, let's not call it preposterous.
Certainly not! (though how one implies the other is really not obvious either). In any case, I've replaced that word - which was aimed at the sheer ambition, not anyone individually. Apologies if that was ambiguous.
Thanks for the comments @h-vetinari!
I want to reiterate that I fully acknowledge the frustrations you and others are having. When I said, "They see an odd error and it costs them time to track down," I meant more than solver errors, but also missing symbols/defines, and any other errors (like the one above).
Nobody, including me, is asking you or anyone else to go around and maintain cos6 compatible software through patches etc. Feedstock maintainers are free to turn on cos7 and move on should they desire. What I am asking of everyone here is to not force cos7 on the full ecosystem (i.e., other feedstocks) and to at least be understanding of the desires of others who are working within real external constraints around cos6. If someone shows up to your feedstock and is willing to maintain cos6 stuff because they need it, I would hope that we would all be generous and not actively prevent them from doing so.
I want to say also that I, with @isuruf, added cos7 to start with. One of the motivations there was to help support users that I work with in my day job. Isuru is right also that people I work with in my day job depend on cos6, including me. So in fact I depend on both! Shipping widely compatible software is really hard it turns out. :)
I've already had to explain to users of a base Anaconda install why only some of their environments work on cos6 (the fully conda-forge ones) and others (the ones pulling from Anaconda) complain about GLIBC
. So whatever we do, we shouldn't cause breakage to those users, they should just slowly stop seeing updates.
Thanks for the response @beckermr!
If someone shows up to your feedstock and is willing to maintain cos6 stuff because they need it, I would hope that we would all be generous and not actively prevent them from doing so.
Absolutely, I would never stand in the way of that. In fact, I go out of my way to keep base packages like numpy/scipy on cos6, because having those on cos7 would spread "virally" to a lot of the ecosystem.
What I am asking of everyone here is to not force cos7 on the full ecosystem (i.e., other feedstocks) and to at least be understanding of the desires of others who are working within real external constraints around cos6.
My first reflex would be to suggest to default to cos7, but allow feedstocks to opt-in to cos6. However, I realise that this is probably not feasible for that segment of users to go hunting down all feedstocks they need and their dependencies...
With that in mind, how about adding a migrator such that the CI runs with both cos6 & cos7 by default. That way, we would solve the debugging issue (if a maintainer sees that cos7 passes where cos6 fails with the exact same recipe, they can either drop cos6 or know where to start debugging). Yes, this would increase the CI usage, but at least that way the burden of bridging those gaps would fall on machines and not on humans (and to my knowledge, c-f is not exposed to volume-dependent pricing on azure pipelines).
What are your thoughts on option 3 above. With that no more solver issues. Runtime bugs like the numpy issue will not be there. Only compile errors.
What are your thoughts on option 3 above. With that no more solver issues. Runtime bugs like the numpy issue will not be there. Only compile errors.
I guess I had not fully appreciated the beauty of this option. If indeed that means we compile with ~cos6~ old glibc but test with cos7 in the same CI job, then I'm all for it.
Regarding resolver errors, I'm not sure I understand though - however we "keep the sysroot at cos6" - if a lower bound of a build dependency moves up such that only cos7 builds are available anymore, isn't that going to lead to the same cryptic errors?
Regarding resolver errors, I'm not sure I understand though - however we "keep the sysroot at cos6" - if a lower bound of a build dependency moves up such that only cos7 builds are available anymore, isn't that going to lead to the same cryptic errors?
Nope.
I went to make sure I understand the answer to this question. Cos6 built code is Abi compatible with cos7 but not the other way around. So if a host dependency need for linking is cos7 only, then as long as we are using a cos7 container, we can make the host environment. At compile time, we get the cos6 Abi in the package we build. Then at run/test time, we are again in a cos7 container and so everything links properly (because cos6 can link to cos7 system packages).
This is a pretty clever build setup!
Looks correct. Only issue is when we are doing static builds and you'll need a cos7 sysroot for all downstream projects if we are using static libraries.
Is there a mutex-like package we should be attaching to static libraries to ensure this happens or at least throws an error? Or given cfeps we have on static libs are we going to declare this to be outside our realm of support?
For static libraries, you can add sysroot_linux-64 2.17
as a run_constrained
, but this is a very rare case that we don't have to bother with.
Ahhhhh yes. Use the sysroot as the mutex itself. We should at least document this as a way to do things even if we don't go around making it happen or actively supporting it.
There seems to be a consensus so I've (hopefully) implemented "Option 3" in https://github.com/conda-forge/conda-forge-pinning-feedstock/pull/2241/files
I'm running into a problem with llvm openmp 16 that looks like it might be testing the limits of what our current setup can handle.
openmp needs a newer than glibc 2.12 for its assumptions about whats in #include <inttypes.h>
, but even if I switch to the 2.17 sysroot, I then get: undefined reference to 'memcpy@GLIBC_2.14'
. I think it is due to this, and it sounds to me like that change of memcpy behaviour might be deep enough that we really need to have everything compiled against 2.17?
Perhaps @isuruf has another ace up his sleeve though? Just wanted to note that openmp >= 16 currently looks unbuildable both with and without sysroot_linux-64 =2.17
.
I tried to push some changes that force cos7 for openmp 16.
FWIW, opencv just moved to COS7 with the release of 4.7.0. https://github.com/conda-forge/opencv-feedstock/pull/346
I think we are well beyond the life cycle of COS6 and many package i've seen attempt to use newer features more and more https://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux#Version_history_and_timeline
We've been putting this off for a long time. I'd advocate we continue to do so and not switch until absolutely necessary. We should understand what the exact issue is here before we proceed.
I would like to ask for guidance on what to do about clock_gettime
. It seems that it is provided by GLIBC in 2.17. However, since we want to support COS6, we shouldn't really "update" to COS7.
Should we add the -lrt
flags?
https://github.com/conda-forge/zstd-feedstock/pull/67
@beckermr zstd seems to be hitting the need to update to cos7 -- https://github.com/conda-forge/zstd-feedstock/pull/71
While we could likely patch things away, seems like busy work
on my present environment, the following packages depend on zstd
$ mamba repoquery whoneeds zstd
Name Version Build Depends Channel
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
blosc 1.21.4 h0f2a231_0 zstd >=1.5.2,<1.6.0a0 conda-forge
boost-cpp 1.78.0 h6582d0a_3 zstd >=1.5.2,<1.6.0a0 conda-forge
c-blosc2 2.9.3 hb4ffafa_0 zstd >=1.5.2,<1.6.0a0 conda-forge
curl 8.1.2 h409715c_0 zstd >=1.5.2,<1.6.0a0 conda-forge
imagecodecs 2023.1.23 py39h9e8eca3_2 zstd >=1.5.2,<1.6.0a0 conda-forge
libcurl 8.1.2 h409715c_0 zstd >=1.5.2,<1.6.0a0 conda-forge
libllvm15 15.0.7 h5cf9203_2 zstd >=1.5.2,<1.6.0a0 conda-forge
libnetcdf 4.9.2 nompi_h0f3d0bb_105 zstd >=1.5.2,<1.6.0a0 conda-forge
libsystemd0 253 h8c4010b_1 zstd >=1.5.2,<1.6.0a0 conda-forge
libtiff 4.5.1 h8b53f26_0 zstd >=1.5.2,<1.6.0a0 conda-forge
llvm-openmp 16.0.6 h4dfa4b3_0 zstd >=1.5.2,<1.6.0a0 conda-forge
mysql-libs 8.0.33 hca2cd23_0 zstd >=1.5.2,<1.6.0a0 conda-forge
notably, llvm seems like it would get bumped to cos7...
Do we feel like it is finally time?
This may be the end indeed. Let's talk it over at the next dev meeting.
I'm all for bumping to cos7, but the zstd
issue seems to be an update where the existing workaround at https://github.com/regro-cf-autotick-bot/zstd-feedstock/blob/1.5.5_hd39c66/recipe/install.sh#L7-L10 doesn't seem to work anymore. It's easy to patch by adding a target_link_libraries(target -lrt)
in the cmake file.
Raising this issue to track and discuss when we want to drop CentOS 6 and move to CentOS 7 as the new default. This also came up in the core meeting this week
cc @conda-forge/core