conda-forge / conda-forge.github.io

The conda-forge website.
https://conda-forge.org
BSD 3-Clause "New" or "Revised" License
127 stars 273 forks source link

Dropping CentOS 6 & Moving to CentOS 7 #1436

Open jakirkham opened 3 years ago

jakirkham commented 3 years ago

Raising this issue to track and discuss when we want to drop CentOS 6 and move to CentOS 7 as the new default. This also came up in the core meeting this week

cc @conda-forge/core

h-vetinari commented 3 years ago

This came up in a numpy issue uncovered by testing the rc's of 1.21.0 for conda-forge - in particular, a test fails due to a bug in glibc 2.12 (not present anymore in 2.17).

There would be a patch to work around the bug, but @rgommers asked:

CentOS 6 is >6 months past EOL. Official EOL really should be the last time to drop a constraint. The patch for working around this problem seems fine to accept, but exercises like these are a massive waste of time. For manylinux we're also fine to move to glibc 2.17 (= manylinux2014), for the same reason. What's the problem here exactly?

I brought this comment into the feedstock PR, where @xhochy noted:

I think we should in general move conda-forge to cos7 but here is probably not the right place to discuss this. Probably we already have an issue for that.

Hence moving it here.

h-vetinari commented 3 years ago

Also xref #1432

h-vetinari commented 3 years ago

Was this issue discussed further at recent core meetings (I've occasionally seen public hackmd notes, but no idea where to find the a collection of them)?

Any statistics or arguments that go against doing this?

Assuming it should be done, this probably needs a migrator (for adding sysroot_linux-64 2.17 everywhere?). I'd be willing to help, but would need some more pointers.

jakirkham commented 3 years ago

It was. Nothing conclusive yet. We collect the meeting notes here

Informally we know there are still some CentOS 6 users (the long tail of support). That said, we do lack statistics either way. So this is something we discussed. Namely how best to collect them

Yeah I think we need to decide this is something we want to do first, which we haven’t done yet

h-vetinari commented 3 years ago

I understand that some people are stuck on EOL'd OSes, but IMO the case to hold back based on that is really tenuous. If you're on an EOL OS, you eventually get no software updates anymore - why should conda-forge go out of its way to still service those users?

I have to agree with @rgommers' statement (I quoted) above - stuff like https://github.com/numpy/numpy/issues/19192 has a real cost. It probably bound 10-20h of maintainer (resp. core contributor) time in total, and would have been completely avoided without an ancient glibc.

h-vetinari commented 2 years ago

Another datapoint: I now have a staged-recipes PR that cannot build because the GPU-build only has glibc 2.12 (pytorch >=1.8 needs 2.17), and the CentOS7 build doesn't start: https://github.com/conda-forge/staged-recipes/pull/16306

isuruf commented 2 years ago

That's not a datapoint. We've documented this in our docs on how to use CentOS7.

h-vetinari commented 2 years ago

That's not a datapoint. We've documented this in our docs on how to use CentOS7.

I know how to do it per-feedstock, but the above packaged cannot currently make it through staged recipes, or at least I'll need help to pull it off. Someone could also merge it and I fix things once the feedstock is created. But it's um... suboptimal... and definitely related to CentOS6, so I'd still call it a datapoint.

isuruf commented 2 years ago

I know how to do it per-feedstock, but the above packaged cannot currently make it through staged recipes, or at least I'll need help to pull it off.

Have you tried doing the same in staged-recipes? It should work.

chrisburr commented 2 years ago

It does work on staged-recipes, see here for an example (CentOS 6 fails as expected but the CentOS 7 based job passes and the feedstock is generated correctly thanks to the conda-forge.yml in the recipe directory.)

That said, I am noticing more and more places where CentOS 6 issues are appearing and moving a feedstock to CentOS 7 causes the downstream feedstocks to also need to be changed causing yet more manual intervention to be needed.

h-vetinari commented 2 years ago

In the last few weeks, I've probably spent upwards of 15h chasing down bugs that ended up being resolved by moving to CentOS 7. This is a real cost. Same for less experienced contributors running into cryptic resolution errors for trying to package something that (now) needs a sysroot 2.17, and end up abandoning the recipes.

@jakirkham: Informally we know there are still some CentOS 6 users (the long tail of support).

Can we quantify this? CentOS 6 is EOL for a year now. Why are we so beholden to that long tail? Are those parties contributing to conda-forge somehow (infra costs or packaging effort)? If not, why are we providing free support longer than even RedHat? More to the point: why do we accept them externalizing their costs for not dealing with 10+ year old software to conda-forge?

That said, we do lack statistics either way. So this is something we discussed. Namely how best to collect them

If it takes X months to collect those statistics, that is a bad trade-off IMO.

chrisburr commented 2 years ago

@conda-forge/core Does anyone have an objections to changing the default sysroot to CentOS 7? If not I'll make PRs to change it early next week.

beckermr commented 2 years ago

I know of users this will impact.

What exactly is the problem with our current setup?

chrisburr commented 2 years ago

I also know users who this will effect, including myself. I also know people using CentOS 5-like systems with conda, who will continue to do so for at least the next decade so we can't wait until nobody is using CentOS 6 anymore.

What exactly is the problem with our current setup?

Over the last 6 months hundreds of hours must have been spent dealing with these issues and I'm not convinced hundreds more should be spent over the next six months. For people really stuck on CentOS 6 we could add a global label (likegcc7 and cf202003) or they can go around forcing the old sysroot using the same mechanism as we currently use for upgrading to CentOS 7 if they really need to.

beckermr commented 2 years ago

Global labels don't get repo data patching which at this point will render the channel likely wrong.

h-vetinari commented 2 years ago

100% agree with what @chrisburr wrote. There are also some pretty gnarly bugs in the trigonometry functions of glibc < 2.17 that have bitten me at least 3 times already.

@beckermr: I know of users this will impact.

And they can keep using old packages, or use paid support for their ancient platforms. I empathise that there are some people between a rock and a hard place, but again:

why do we accept them externalizing their costs for not dealing with 10+ year old software to conda-forge?

Those 100s of hours Chris is mentioning might be "free" but they come at the cost of other things not being improved or fixed or packaged, and barring strong countervailing reasons, that's IMO a horrible trade-off to make against the ecosystem in favour of an unspecified handful of people who cannot manage to run less-than-decade-old software, yet need the newest packages.

beckermr commented 2 years ago

Many folks stuck on an older centos are not there by choice. They are constrained by the lack of upgrades on big systems run by government labs, etc. The idea that they can simply pay for support is a non-starter to anyone who works in or understands how those organizations work.

I am bringing this up because the remedies for using cos6 that folks keep bringing up here are not really available to the people that need cos6.

We are making a choice to leave them behind when a majority of the software we build does not require cos6 at all.

I suspect a much better path would be to further improve our support for cos7 in smithy or our bots.

leofang commented 2 years ago

Many folks stuck on an older centos are not there by choice. They are constrained by the lack of upgrades on big systems run by government labs, etc.

If you are referring to DOE labs, last time I heard the BES office demanded a through upgrade from its facilities due to cybersecurity concerns (cc: @mrakitin) and I assume the similar mandates should also be posted by other offices.

alippai commented 2 years ago

@beckermr the legacy software on the legacy systems will keep running even if conda forge starts building on CentOS7. CentOS 6 was released literally 10 years ago. Government labs running inefficient HW and SW stack is not something anyone should encourage or promote. That hurts the economy, research and the environment. Those systems cost everyone time and money (along with conda forge people and contributors). My understanding is that both build performance and the performance of the built libs is different on Conda 6 vs 7, isn't this true?

beckermr commented 2 years ago

Thanks for the responses everyone!

I don't see anyone addressing directly the points I raised. The cost here is the time for folks who need cos7 and don't know it when they are building a package. They see an odd error and it costs them time to track down. I 100% agree that this cost is real.

Moving the default to cos7 is one way to reduce this cost. However it is not the only way. My premise is that given the headache this will cause for cos6 users in general, and that fact that cos7 is not required the majority of the time, we're better off improving the tooling around cos7 so that maintainers can better use it.

chrisburr commented 2 years ago

Global labels don't get repo data patching which at this point will render the channel likely wrong.

Good point, I forgot about this. Hopefully the __glibc contraint can be good enough to allow people to keep using the channel. 🀞

I suspect a much better path would be to further improve our support for cos7 in smithy or our bots.

This is might an option but I'm not sure it's easy to do the "right" thing and it might not even be possible. How do you see this working? I have two ideas and I think I would lean towards option 1 for simplicity.

Option 1

The bot automatically migrates downstream feedstocks as soon an upstream feedstock moves to be CentOS 7-only.

Option 2

Try to be smarter and use solvability as a constraint i.e.

I'm not sure how stable it will be and I suspect there are a lot of unstable edgecases. In particular what happens if both CentOS 6 and CentOS 7 are unsolvable?

isuruf commented 2 years ago

Option 3

Change the default docker image to be cos7 for all feedstocks, but keep the sysroot to be cos6. This would remove the solver errors.

beckermr commented 2 years ago

Option 1 is not really correct. If python itself was cos7 only, we wouldn't need cos7 constraints/builds of noarch python packages.

Option 2 I am not following.

The core issue here as I understand it is that people are having trouble recognizing errors as being due to not having cos7 and when they do, they are having trouble enabling the cos7 build.

beckermr commented 2 years ago

I think we could add an admin command that would convert a recipe to cos7 automatically in a pr. This should work all the time except for gnarly cases around cuda. It would more or less solve the issues around provisioning cos7 on feedstocks.

For recognizing errors as being due to cos7, that is a much tougher problem.

chrisburr commented 2 years ago

I like option 3 provided we trust the majority of packages to be using the correct sysroot? (i.e. from $BUILD_PREFIX and not /)

beckermr commented 2 years ago

Good point @chrisburr. We might get another class of errors if the docker image doesn't match and the builds are wrong. Otoh those builds are actually wrong and so likely should be fixed anyways.

isuruf commented 2 years ago

@chrisburr, there's no sysroot in / in our docker images (unless there's a yum_requirements.txt file)

mbargull commented 2 years ago

Option 3 has been working well for us at Bioconda since March/April or so. (I.e., using quay.io/condaforge/linux-anvil-cos7-x86_64 as the base but not requiring sysroot_linux-64=2.17 by default.)


One minor thing to keep in mind is then we don't automatically test our packages on CentOS 6 user space anymore -- which isn't necessarily a bad thing because we then test on CentOS 7's which we didn't before. (The corner cases where testing on COS 6 would matter I expect to be very few.)


Two things I'd like to know:

  1. Do we want in the docs to carry a (reasonably small) list of common compile errors to give people pointers for when they'd need sysroot_linux-64>=2.17? (E.g., I encountered O_PATH not being defined a couple of times (man 2 openat|grep 'O_PATH.*since' says O_PATH (since Linux 2.6.39)))
  2. Should we add something to the docs to avoid package maintainers being lured into having to maintain patches for old system support? (I.e., if one demands such support then they should be ready to offer maintenance work themselves.)

(I guess 2. is just me being too protective and we probably/hopefully don't need it. I'd just like to avoid having maintainers having to argue around fulfilling the needs of a small/shrinking user base.)

h-vetinari commented 2 years ago

The idea that they can simply pay for support is a non-starter to anyone who works in or understands how those organizations work.

I understand how these organisations work, but the point is why does it fall to conda-forge to bridge that gap, and more importantly, who decides that supporting these users is worth all the lost time and opportunity cost for everyone else.

Personally, I find it ~preposterous~ comically bizarre for a small band of volunteers to try to provide free maintenance longer than a behemoth enterprise like RedHat.

I don't see anyone addressing directly the points I raised.

I have trouble with that statement, because neither are you acknowledging any of the costs beyond the cryptic resolver errors. Here are some examples (1 2 3 4) of things that consumed hours of debugging and upstream maintainer time just because the feedstocks were on cos6 (i.e. the errors simply went away with cos7). I'm sure other people are hitting such issues as well - how many will not find the magic answer and just give up?

For example, how would anyone guess that the following is due to using an old glibc (for me it was sheer luck and persistence)?

E       AssertionError:
E       Arrays are not equal
E
E       Mismatched elements: 2 / 108 (1.85%)
E       Max absolute difference: 1.
E       Max relative difference: 0.04761905
E        x: array([[[ 0.,  1.,  2.,  3.,  4.,  5.],
E               [ 6.,  7.,  8.,  9., 10., 11.],
E               [12., 13., 14., 15., 16., 17.],...
E        y: array([[[ 0,  1,  2,  3,  4,  5],
E               [ 6,  7,  8,  9, 10, 11],
E               [12, 13, 14, 15, 16, 17],...

Maintaining feedstock and debugging weird errors is enough work as it is, and accepting the possibility of such insanely hard to fix bugs is a huge cost. Vague gesturing that some organisations cannot move on is IMO not nearly a compelling enough argument for that and there should be more transparency in that decision making process.

isuruf commented 2 years ago

I understand how these organisations work, but the point is why does it fall to conda-forge to bridge that gap, and more importantly, who decides that supporting these users is worth all the lost time and opportunity cost for everyone else.

Personally, I find it preposterous for a small band of volunteers to try to provide free maintenance longer than a behemoth enterprise like RedHat.

That's because core maintainers like @beckermr cares about older sysroot for their dayjob and I care about ppc64le for my dayjob. Unless you would like to lose core maintainers, let's not call it preposterous.

h-vetinari commented 2 years ago

That's because core maintainers like @beckermr cares about older sysroot for their dayjob and I care about ppc64le for my dayjob.

That's a very different kind of argument than appeared so far.

Unless you would like to lose core maintainers, let's not call it preposterous.

Certainly not! (though how one implies the other is really not obvious either). In any case, I've replaced that word - which was aimed at the sheer ambition, not anyone individually. Apologies if that was ambiguous.

beckermr commented 2 years ago

Thanks for the comments @h-vetinari!

I want to reiterate that I fully acknowledge the frustrations you and others are having. When I said, "They see an odd error and it costs them time to track down," I meant more than solver errors, but also missing symbols/defines, and any other errors (like the one above).

Nobody, including me, is asking you or anyone else to go around and maintain cos6 compatible software through patches etc. Feedstock maintainers are free to turn on cos7 and move on should they desire. What I am asking of everyone here is to not force cos7 on the full ecosystem (i.e., other feedstocks) and to at least be understanding of the desires of others who are working within real external constraints around cos6. If someone shows up to your feedstock and is willing to maintain cos6 stuff because they need it, I would hope that we would all be generous and not actively prevent them from doing so.

I want to say also that I, with @isuruf, added cos7 to start with. One of the motivations there was to help support users that I work with in my day job. Isuru is right also that people I work with in my day job depend on cos6, including me. So in fact I depend on both! Shipping widely compatible software is really hard it turns out. :)

dopplershift commented 2 years ago

I've already had to explain to users of a base Anaconda install why only some of their environments work on cos6 (the fully conda-forge ones) and others (the ones pulling from Anaconda) complain about GLIBC. So whatever we do, we shouldn't cause breakage to those users, they should just slowly stop seeing updates.

h-vetinari commented 2 years ago

Thanks for the response @beckermr!

If someone shows up to your feedstock and is willing to maintain cos6 stuff because they need it, I would hope that we would all be generous and not actively prevent them from doing so.

Absolutely, I would never stand in the way of that. In fact, I go out of my way to keep base packages like numpy/scipy on cos6, because having those on cos7 would spread "virally" to a lot of the ecosystem.

What I am asking of everyone here is to not force cos7 on the full ecosystem (i.e., other feedstocks) and to at least be understanding of the desires of others who are working within real external constraints around cos6.

My first reflex would be to suggest to default to cos7, but allow feedstocks to opt-in to cos6. However, I realise that this is probably not feasible for that segment of users to go hunting down all feedstocks they need and their dependencies...

With that in mind, how about adding a migrator such that the CI runs with both cos6 & cos7 by default. That way, we would solve the debugging issue (if a maintainer sees that cos7 passes where cos6 fails with the exact same recipe, they can either drop cos6 or know where to start debugging). Yes, this would increase the CI usage, but at least that way the burden of bridging those gaps would fall on machines and not on humans (and to my knowledge, c-f is not exposed to volume-dependent pricing on azure pipelines).

isuruf commented 2 years ago

What are your thoughts on option 3 above. With that no more solver issues. Runtime bugs like the numpy issue will not be there. Only compile errors.

h-vetinari commented 2 years ago

What are your thoughts on option 3 above. With that no more solver issues. Runtime bugs like the numpy issue will not be there. Only compile errors.

I guess I had not fully appreciated the beauty of this option. If indeed that means we compile with ~cos6~ old glibc but test with cos7 in the same CI job, then I'm all for it.

Regarding resolver errors, I'm not sure I understand though - however we "keep the sysroot at cos6" - if a lower bound of a build dependency moves up such that only cos7 builds are available anymore, isn't that going to lead to the same cryptic errors?

isuruf commented 2 years ago

Regarding resolver errors, I'm not sure I understand though - however we "keep the sysroot at cos6" - if a lower bound of a build dependency moves up such that only cos7 builds are available anymore, isn't that going to lead to the same cryptic errors?

Nope.

beckermr commented 2 years ago

I went to make sure I understand the answer to this question. Cos6 built code is Abi compatible with cos7 but not the other way around. So if a host dependency need for linking is cos7 only, then as long as we are using a cos7 container, we can make the host environment. At compile time, we get the cos6 Abi in the package we build. Then at run/test time, we are again in a cos7 container and so everything links properly (because cos6 can link to cos7 system packages).

This is a pretty clever build setup!

isuruf commented 2 years ago

Looks correct. Only issue is when we are doing static builds and you'll need a cos7 sysroot for all downstream projects if we are using static libraries.

beckermr commented 2 years ago

Is there a mutex-like package we should be attaching to static libraries to ensure this happens or at least throws an error? Or given cfeps we have on static libs are we going to declare this to be outside our realm of support?

isuruf commented 2 years ago

For static libraries, you can add sysroot_linux-64 2.17 as a run_constrained, but this is a very rare case that we don't have to bother with.

beckermr commented 2 years ago

Ahhhhh yes. Use the sysroot as the mutex itself. We should at least document this as a way to do things even if we don't go around making it happen or actively supporting it.

chrisburr commented 2 years ago

There seems to be a consensus so I've (hopefully) implemented "Option 3" in https://github.com/conda-forge/conda-forge-pinning-feedstock/pull/2241/files

h-vetinari commented 1 year ago

I'm running into a problem with llvm openmp 16 that looks like it might be testing the limits of what our current setup can handle.

openmp needs a newer than glibc 2.12 for its assumptions about whats in #include <inttypes.h>, but even if I switch to the 2.17 sysroot, I then get: undefined reference to 'memcpy@GLIBC_2.14'. I think it is due to this, and it sounds to me like that change of memcpy behaviour might be deep enough that we really need to have everything compiled against 2.17?

Perhaps @isuruf has another ace up his sleeve though? Just wanted to note that openmp >= 16 currently looks unbuildable both with and without sysroot_linux-64 =2.17.

hmaarrfk commented 1 year ago

I tried to push some changes that force cos7 for openmp 16.

FWIW, opencv just moved to COS7 with the release of 4.7.0. https://github.com/conda-forge/opencv-feedstock/pull/346

I think we are well beyond the life cycle of COS6 and many package i've seen attempt to use newer features more and more https://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux#Version_history_and_timeline

beckermr commented 1 year ago

We've been putting this off for a long time. I'd advocate we continue to do so and not switch until absolutely necessary. We should understand what the exact issue is here before we proceed.

hmaarrfk commented 1 year ago

I would like to ask for guidance on what to do about clock_gettime. It seems that it is provided by GLIBC in 2.17. However, since we want to support COS6, we shouldn't really "update" to COS7.

Should we add the -lrt flags? https://github.com/conda-forge/zstd-feedstock/pull/67

hmaarrfk commented 1 year ago

@beckermr zstd seems to be hitting the need to update to cos7 -- https://github.com/conda-forge/zstd-feedstock/pull/71

While we could likely patch things away, seems like busy work

on my present environment, the following packages depend on zstd

$ mamba repoquery whoneeds zstd

 Name        Version   Build              Depends               Channel     
─────────────────────────────────────────────────────────────────────────────
 blosc       1.21.4    h0f2a231_0         zstd >=1.5.2,<1.6.0a0 conda-forge 
 boost-cpp   1.78.0    h6582d0a_3         zstd >=1.5.2,<1.6.0a0 conda-forge 
 c-blosc2    2.9.3     hb4ffafa_0         zstd >=1.5.2,<1.6.0a0 conda-forge 
 curl        8.1.2     h409715c_0         zstd >=1.5.2,<1.6.0a0 conda-forge 
 imagecodecs 2023.1.23 py39h9e8eca3_2     zstd >=1.5.2,<1.6.0a0 conda-forge 
 libcurl     8.1.2     h409715c_0         zstd >=1.5.2,<1.6.0a0 conda-forge 
 libllvm15   15.0.7    h5cf9203_2         zstd >=1.5.2,<1.6.0a0 conda-forge 
 libnetcdf   4.9.2     nompi_h0f3d0bb_105 zstd >=1.5.2,<1.6.0a0 conda-forge 
 libsystemd0 253       h8c4010b_1         zstd >=1.5.2,<1.6.0a0 conda-forge 
 libtiff     4.5.1     h8b53f26_0         zstd >=1.5.2,<1.6.0a0 conda-forge 
 llvm-openmp 16.0.6    h4dfa4b3_0         zstd >=1.5.2,<1.6.0a0 conda-forge 
 mysql-libs  8.0.33    hca2cd23_0         zstd >=1.5.2,<1.6.0a0 conda-forge

notably, llvm seems like it would get bumped to cos7...

Do we feel like it is finally time?

beckermr commented 1 year ago

This may be the end indeed. Let's talk it over at the next dev meeting.

isuruf commented 1 year ago

I'm all for bumping to cos7, but the zstd issue seems to be an update where the existing workaround at https://github.com/regro-cf-autotick-bot/zstd-feedstock/blob/1.5.5_hd39c66/recipe/install.sh#L7-L10 doesn't seem to work anymore. It's easy to patch by adding a target_link_libraries(target -lrt) in the cmake file.