When to package software which is already in the default conda channel

ocefpaf commented 8 years ago

(@JanSchulz brought this up in https://github.com/conda-forge/conda-forge.github.io/issues/16#issuecomment-182430891)

I Agreed with @JanSchulz that we should avoid as much as possible to add packages in conda-forge that are available in the default channel.

However, we already have a few redundant packages (pyproj, shapely, geos, and more to come soon). The reason for th1 redundancy is that those packages are partially broken in the default channel. ~~(And we could not find a proper channel of communication to send the recipe patch back to them.)~~

Maybe, when fixing a default channel package we should allow the package addition here as long as there is a plan to send that fix back to the default channel, and to remove the package from conda-forge once that happens.

njsmith commented 8 years ago

Could you elaborate on this point a bit more, @msarahan and/or @njsmith?

I don't know that conda would necessarily want to follow the exact evolution of the RH ecosystem, but Fedora, RHEL, and CentOS are 3 closely related systems that share lots of engineering work but make different kinds of trade-offs and target different market segments -- Fedora as a fast-moving project with an emphasis on free software ideals and community-led-governance (but with support from RH for infrastructure like servers and legal compliance), that also serves as a beta testing ground for RHEL; RHEL as the commercially supported ultra-stable enterprise platform, and CentOS as a free version of RHEL without the commercial support for those who want a slow-moving enterprise-y product and to take advantage of RH's QA, but don't want to pay for it. (There's definitely a desire in the community a "CentOS Anaconda" -- cf. the repeated complaints about the closed build recipes for core packages.) All together these form a neat ecosystem in which the different parse support each other -- e.g. CentOS might sound like a competitor to RHEL, but in fact it's RH themselves manage it and see it as a kind of loss leader that gets people into their ecosystem, makes it easier for third-parties to support RHEL (see e.g. the use of CentOS docker images for building Anaconda packages!), and soaks up the cheap customers so that RHEL can focus on the much more lucrative enterprise market.

...I'm actually a bit surprised that I haven't heard anything about Continuum aggressively trying to poach members of RH senior management, the business model parallels are really strong both in terms of the specific distro stuff + the emphasis on contributing upstream to community OSS projects, and RH is the best in the world at making that business model work both on the money side and the community side :-). Maybe (probably) I'm just not in those conversations...

jakirkham commented 8 years ago

Ok, thanks for clarifying.

I'm actually a bit surprised that I haven't heard anything about Continuum aggressively trying to poach members of RH senior management...

If they weren't thinking it before... :)

sstirlin commented 8 years ago

@jakirkham This is definitely an interesting project. You're really taking it to the next level. I'm intrigued.

Quick question: how are you handling different recipes for different versions? For example, the recipe to build cmake 3.5.0 will be different than the recipe to build cmake 3.3.2.

I maintain a separate recipe for each version. Sometimes I even need separate recipes to build against numpy 1.8 vs 1.9 vs 1.10.

jakirkham commented 8 years ago

cc @stuarteberg @ukoethe

pelson commented 8 years ago

Quick question: how are you handling different recipes for different versions? For example, the recipe to build cmake 3.5.0 will be different than the recipe to build cmake 3.3.2.

I maintain a separate recipe for each version. Sometimes I even need separate recipes to build against numpy 1.8 vs 1.9 vs 1.10.

That's yet to be fully fleshed out, the phraseology to date has been "one recipe, one repository", but from the very beginning in my head this has really been more like "one package, one repository". In precisely the same way as one would manage two versions of the same software in a repo, we can manage two versions of a recipe in a repo - with branches. GDAL-feedstock was the first feedstock to make use of branches in this way, and in truth we haven't yet followed that through into the infrastructure (e.g. are the maintainers of the feedstock the union of the maintainers in the various recipes etc.).

jakirkham commented 8 years ago

I opened an issue ( https://github.com/conda-forge/conda-forge.github.io/issues/50 ) to discuss the versioning point more and come up with a standard for solving this kind of problem.

jakirkham commented 8 years ago

If there is a (very out-of-date) recipe for a package currently in conda-recipes do you want me to leave a note there when I add it to conda-forge?

I have tried to answer your question here, @mwcraig, because it feels like a policy/community direction question that is closely related to the transition and where we go from here. So, didn't want it to get lost in some unrelated merged PR. Sorry it has gotten so long. It just got me thinking about how we move forward. :smile:

Here are my thoughts on it and some related things to this transition. Other people may have thoughts on this, as well. It would be good if we can figure out the right way forward on the movement of recipes here (from conda-recipes and possibly other sources) and how to provide that information to others (particularly in terms of volume).

When I move and update a recipe from conda-recipes (or anaconda-recipes) to here, I try to follow these guidelines. As part of that, I notify people who have modified the history of the recipe because they may be interested in the package it builds as they may be using it as a dependency for something. In addition, I may try to notify a core maintainer of the project or so. This process helps to generally increase awareness about conda-forge (in some cases conda) and what we are trying to do here. Also, it allows people to become aware of how the package management ecosystem in conda is changing. Finally, it gives people an opportunity to take a larger role in how packages that are important to them get distributed either by becoming a maintainer, simply submitting patches to improve the build, filing issues about how the package can be improved, or (in the case of core maintainers) notifications about when new releases are coming out so that we can get them in here quickly. All of these help improve the package management ecosystem here, which should in turn benefit the community.

Admittedly, the strategy above (of notifying a few potentially interested people) is good for slow, but consistent growth. At this point, we have 165 packages maintained here (at least according to the conda-forge channel) and it is continuing to grow. We now have 31 members. Some from Continuum, some from the Python community with various interests, a few have little if nothing to do with Python, but have become interested as this transition has occurred. This allows us to continue to fine tune the performance of our infrastructure (something we have been doing a fair bit of), experiment with things (e.g. alternative Python distributions, use of various compiler features, etc.), and discuss various approaches to interesting and challenging problems in our unique form of package management (e.g. compiler optimizations, runtime selection of AVX and SSE optimizations, API implementation selection, etc.). However, our rate of growth hasn't forced us to make hard decision on these without taking time to consider the options and how best they might be approached. While the right rate of growth is certainly up for discussion, IMHO we are growing at a reasonable rate.

The reason I mention the rate of growth here is it affects how we de-dup conda-recipes. Namely different strategies for de-duping will have different affects on how quickly our community grows. That being said, we should probably figure out how de-duping is going to occur between here and conda-recipes as maintaining two versions is of no benefit to anyone and a bit confusing too. Here are some options that have been considered and some other ones I am now thinking of, which I have cobbled together into a rough plan that would happen over time (though feedback is definitely welcome and by no means am I saying we need to commit to this). Maybe some combination of these is the right solution. There are probably more, as well.

Mark transferred recipes as deprecated with information about where to go for their current copies.
Place a note in the conda-recipes (and anaconda-recipes) Readme explaining that content from there is being curated and moved over here and explain what here is.
Replace them with submodules to feedstocks.
Remove recipe(s) and/or conda-recipes.

By doing, (1) the user is made aware on a per recipe basis that we have shifted it over and that further changes should be made here. While this helps its a bit localized and doesn't address the numerous PRs being added to conda-recipes for new packages. Combined with the existing pings for this movement, it should draw around the same number of people here maybe a few more (who were going to make some modification to it).

By doing (2), namely informing users that they should be adding new packages to conda-forge not conda-recipes. This gives them a good chance of getting binaries (something they likely want) even on platforms they may be unable to build on themselves. The low barrier to entry will be particular nice for them to do this. However, we need to keep our eyes open for abandonment. Having a large swath of unmaintained recipes is bad for everyone. This would definitely increase traffic (at least of those that read :wink:). So, we probably want to make sure things are mostly settled down (a significant chunk of the packages have moved guessing half maybe a little less) here before we explore that.

Doing (3) is a bit tricky (which I will explain), but is to replace deprecated recipes with feedstocks as git submodules. As feedstocks don't have recipes in the top level directory, but one below it makes them a little difficult to use in recursive builds. If we can tweak conda build to correct for this issue then (3) will be more reasonable. This may seem redundant compared to the other steps (particular 1) as this (3) is the biggest attention grabber that suggests things have moved and immediately links the user to their new location. Though that combined with the technical issues is a reason to hold off on it until we are ready for that level of traffic.

Finally, at some point, we may want to eliminate conda-recipes (4). However, this may depend on whether (3) can be accomplished successfully and how confusing it is. We will need to have some sort of deprecation notice on the conda-recipe's Readme. Anyone that we wouldn't have gotten will be here, so things should be pretty stable at that point and we may already have most of conda-recipes here.

This all up for discussion and none of it is set in stone. Though it is something that I felt like sharing for discussion. We need to deprecate conda-recipes, but we need to do it with an eye towards how well we can saturate the demand here.

Thoughts? Questions? Feedback? Is it all totally wrong? :stuck_out_tongue_winking_eye:

pelson commented 8 years ago

Thoughts? Questions? Feedback? Is it all totally wrong? :stuck_out_tongue_winking_eye:

That was a long comment! :smile: I completely agree with the growth - you've been an invaluable ambassador for conda-forge over the last few weeks, and many of the (IMHO impressive) 31 contributors are down in no small part to you :+1:

I'd like to explore option 1 some more, as I think that is the only way we can truly maintain community recipes which are tested on the platforms they claim to work for.

mwcraig commented 8 years ago

While thinking about this on my own before I found time to read your comment and the guidelines, I was leaning towards a request to add the package of interest (astropy) here and make a simultaneous pull request to delete the astropy repo in conda-recipes, which is very badly out of date (its version is 0.2.x and astropy is up to 1.1.X).

I could see adding a deprecation note instead to the astropy; the broader question about transitioning is more difficult.

Once we are confident the infrastructure can scale I think an announcement to the conda and anaconda email lists from someone at Continuum indicating the Future of Conda Recipe Hosting would be helpful, with the eventual elimination of conda-recipes the end goal. A dashboard like the one at https://conda-forge.github.io//feedstocks.html could be used to point people to the correct repo for a particular package.

Part of the transition should include, at some point, turning off new PRs to conda-recipes, and getting the currently open PRs there either merged before migrating recipes or migrating the PRs.

In terms of the options you laid out I'm advocating for (1) short term, followed by (2) once we know what scales here.

Once a recipe works here I'd be inclined to delete it in conda-recipes, or replace the recipe there with a meta.yaml that just contains a link to the feedstock. A submodule would work too -- I don't know how widely conda-recipes is used for building large sets of packages.

Eventually (4) is necessary, I think. Given enough lead time (6 months or a year?) it shouldn't cause much disruption.

msarahan commented 8 years ago

I don't think killing conda-recipes is the right way. Its contents should definitely live elsewhere, but conda-recipes itself is an important aggregation, and contains more than just Python packages (which is the primary focus of conda forge at the moment). Conda recipes could also serve to collect recipes (submodules) from sources other than conda-forge, if any project wants to maintain their recipe themselves, outside of conda-forge. I'm in favor of 1 and 2 now, with 3 (with conda-build fixes) down the road a bit.

jakirkham commented 8 years ago

Given your thoughts on channel de-duping, I was curious if you had any thoughts on this, @mcg1969?

jankatins commented 8 years ago

Deduping recipe in repos is unfortunately not solving the problem of binary deduping when conda-forge includes a package from the default channel (e.g. matplotlib) or in the future when the default channel gets feed builds (or better recipes) from conda-forge. IMO this is a problem because if this is not coordinated, it will at least lead to hard to track bugreports when it's not clear which version of a package is installed (the mpl packages have AFAIK currently different dependencies in default and conda-forge). In the worst case, it will lead to incompatible packages.

I think the default should be to install from default, so unless something drastic comes up (= bug in default), packages with the same upstream version should get installed from default, but higher upstream versions should be prefered from wherever they come from.

Therefore I would like to propose this scheme (essentially the debian backport scheme):

Append "cf" to the build-string of all packages per default (= manual work :-(). If there is a reason to prefer the package from the conda-forge channel, then the build string should be changed to 1cf (or 1.cf?), if 1 would be the next build string in the default channel. Conda-forge internal builds increment by appending a number: cf1.

This would result in the following behaviour:

If a package from default is added to the conda-forge channel, this ensures that (given no other reasons) the package is installed from the default channel if the same upstream version is available there.
If the package is imported into the default channel, the same happens (and continuum needs to ensure that the build string is removed, which the above selector would do)
If the conda-forge build scripts adds an environment variable, a selector for that can be added so the build string lines only shows up when the package is build on the conda-forge package builders and continuum would not need to change the recipe if they choose to import it. The linter could check for this by importing twice, once with the env variable set and once not and compare the versions.

For this to work, all package which have a package in default and in conda-forge need to have the recipe of the default channel available otherwise the same problem as with the current mpl situation arise...

So another "policy" would need to be:

One of the recipe versions is "upstream": the "taker" should only modify the build-string and add patches to fix bugs in the package but not change the "spirit" of the recipe (e.g. remove/add dependencies to change functionality). Bigger changes should be done in the "upstream" repository.

Some examples:

package	default	conda forge	default	explanation
matplotlib	1.5.0	1.5.1.cf	--	just the conda-forge copy of matplotlib to get new upstream versions earlier -> when `default`catches up, the `default`channel is prefered
matplotlib	1.5.1	1.5.1.cf1	--	`default`catches up, but has a different recipe -> conda-forge needs to release a new package to catch up -> `1` after `cf`
matplotlib	1.5.1	1.5.1.1cf	--	A fix for the package in `default` (`1` in front of `cf`), conda-forge is prefered until `default` has a new version (either upstream or with a build string)
whatever	---	1.1.cf	1.1	`default` gets the recipe from conda-forge and removes the cf build line and the package is sorted higher and is now installed from `default`

jjhelmus commented 8 years ago

After reading all the various suggestions which involve name mangling and custom version numbers, I'm thinking that @janschulz's original suggestion of having a separate channel for conda-forge packages which are also present in default channel seems like a great solution. If we moved all the duplicated packages into a new channel, say conda-forge-core, then users would need to explicitly add that channel or specify it in a conda install command.

ocefpaf commented 8 years ago

If we moved all the duplicated packages into a new channel, say conda-forge-core, then users would need to explicitly add that channel or specify it in a conda install command.

I don't like the idea of more channels. IMO our goal should be quite the opposite: improve the communication to get fixes/updates/patches from conda-forge into the default channel. We don't have a concrete example of that happening right now, but @msarahan and others are present here and monitoring the activity. I see that as a win.

We do have a different problem regarding same package and version/build number. I think that must be fixed in conda. All we can do for ow is to bump our build number to a higher value than the default channel to avoid conflicts.

ocefpaf commented 8 years ago

I am closing this issue as I believe we already know what to do when submitting a package that is already in the default channel. Just write the reason why are you submitting the package to conda-forge in the PR (e.g.: new patch to solve X, missing dependencies, latest version, etc).

jankatins commented 8 years ago

@ocefpaf This solution is not enough when continuum starts to import packages form conda-forge into default.

jakirkham commented 8 years ago

Please see this PR ( https://github.com/conda/conda/pull/2323 ), which is trying to better address channel conflicts.

ocefpaf commented 8 years ago

@ocefpaf This solution is not enough when continuum starts to import packages form conda-forge into default.

Why not? If they keep up the pace we can just drop our version. If not we can keep on releasing and hope that conda/conda#2323 will allow them to live happy together.

jankatins commented 8 years ago

Because packages will end up with different things in them but exactly the same version numbers (as long as continuum does not start importing the binary packages, which IMO is not a good idea as they would have to trust every member of an org with now already 41 people in it). This will lead to things like one version having a fixed openssl included and the other version not simple because of the time when the versions were build. It might only happen a few times per package but when conda-forge has ~1000 packages, this adds up due to maintainance burden for hard to debug situations.

If this gets worse by having two different packages in these two channels (as it is currently--or at least can be--the case with the mpl package), this results in an even greater nightmare...

I don't say the above enhancement is bad: it's actually great, but I think it's more addressing the problem of having a user channel and overwriting packages in the default channel with other ones and not the problem of two versions having (almost) the same metadata.

A completely technical solution to the above problem would be if the build string could be split up into three parts: old_build_number + setting from environment + new_one. A repackager/ taker can only touch the new_one (apart from new upstreams or bugfixes), the original recipe only touches the old_number and the condaforge scripts set an environment variable which sets the middle to cf and continuum does not set it at all. On build they get mangled into the normal build string which then implements the scheme above. This would ensure that if a user has both channels included, they would get the "right" package (=whoever has the higher upstream version and on same upstream version, the default channel wins). This happens without user intervention via pinning.

And you can see on first glance what packages came from the conda-forge channel, even if the user downloaded and installed the package manually.

ocefpaf commented 8 years ago

Because packages will end up with different things in them but exactly the same version numbers

I understand that problem and I don't think it is any different from the Linux distro repositories problem. And this is how they solved it: A big warning to any user that is adding any third party repository. (I think that continuum is really far behind in doing that btw :wink: )

Together with the warning they provide ways to choose preference repo order, freeze a package to a repo, or freeze a package from any updates.

but I think it's more addressing the problem of having a user channel and overwriting packages in the default channel with other ones and not the problem of two versions having (almost) the same metadata.

I did not take a close at conda/conda#2323 to comment. However, I disagree that the packages have the same metadata. The origin is different and that is part of the metadata. (The most important part IMO.) I think that build strings are redundant and the technical use you recommend will create an unnecessary complexity.

jankatins commented 8 years ago

fair enough :-) If it becomes a problem in the future, it can be solved then...

ChrisBarker-NOAA commented 8 years ago

Because packages will end up with different things in them but exactly the same version numbers

This is basically a problem you will get as long as there is more than source for the same package.

But if continuum pays attention to what we are doing ( and they do seem to be), than they can increment the build number, and we're good to go.

Not a technical solution, but what can you do?

Also, if the default channel continues to be prioritized, then even if there are duplicate build numbers, users will get the "official" version be default, which is probably good, and at least predictable.

-CHB

I understand that problem and I don't think it is any different from the Linux distro repositories problem. And this is how they solved it: A big warning to any user that is adding any third party repository. (I think that continuum is really far behind in doing that btw [image: :wink:] )

Together with the warning they provide ways to choose preference repo order, freeze a package to a repo, or freeze a package from any updates.

but I think it's more addressing the problem of having a user channel and overwriting packages in the default channel with other ones and not the problem of two versions having (almost) the same metadata.

I did not take a close at conda/conda#2323 https://github.com/conda/conda/pull/2323 to comment. However, I disagree that the packages have the same metadata. The origin is different and that is part of the metadata. (The most important part IMO.) I think that build strings are redundant and the technical use you recommend will create an unnecessary complexity.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/conda-forge/conda-forge.github.io/issues/22#issuecomment-205012788

mcg1969 commented 8 years ago

Yes, please do offer feedback on https://github.com/conda/conda/pull/2323 . It is subject to improvement---both before we merge it and after. But conda-forge is definitely one of the reasons that PR was built.

mcg1969 commented 8 years ago

My view of build strings is that they serve exactly one purpose: to prevent duplicate filenames---and, in doing so, to allow users to specify a specific build of a package when they need to. I don't think it is a good idea to endow them with any semantic content that the underlying solver must depend upon.

We should be relying on channels (now that 2323 is in the pipeline), dependency differences, and features to achieve differentiation. And if those are insufficient, we should come up with new metadata approaches. But the filename itself should be irrelevant to the solver.

That's not to say that the build string and filename can't be built from the metadata, however.

ukoethe commented 8 years ago

Because packages will end up with different things in them but exactly the same version numbers

I didn't follow the discussion in detail, but would like to point out that it is possible to add version tags to version numbers in order to disambiguate variants, like foo-1.2.3.tag1.propA vs. foo-1.2.3.tag2.propB. Other packages can use these to refine requirements:

requirements:
  run:
    - foo   *.tag1*    # won't pick up *.tag2*

I don't claim that this is necessarily a good solution, but it might be another useful trick to address the ambiguity problem. The good thing about these tags is that one can take advantage of conda's powerful version comparison and resolution algorithms.

mcg1969 commented 8 years ago

I think that channels and subchannels in particular will become very powerful once something like 2323 is implemented. I think that may be the proper way to host multiple variants of the same package.

ocefpaf commented 8 years ago

I think that channels and subchannels in particular will become very powerful once something like 2323 is implemented. I think that may be the proper way to host multiple variants of the same package.

:+1:

jakirkham commented 8 years ago

Hmm...not sure I see how subchannels work or how that will fit into our infrastructure yet.

jakirkham commented 8 years ago

@gpilab, I noticed your channel recently and noticed that we have a lot of overlap in terms of packages we provide. Maybe you would be interested in getting packages from conda-forge. Also, as those packages are some of your dependencies, maybe being added as maintainers to the would be useful to you. I would be really interested in helping you figure your way around conda-forge. Feel free to give me a ping. :smile:

jakirkham commented 8 years ago

@NLeSC @remenska, noticed that you have a variety of interesting packages some present here and some not yet present (though we are eager to add). Given this is quickly becoming the place to get packages that may not yet be packaged by Continuum and we do the builds in automated VMs in very clean environments, I think you might benefit by adding some of your packages here. Also, feel free to sign up for packages that are valuable to your effort. If you need any help figuring out what is going on, please feel free to ping me and I will be happy to get you started. :smile:

danielfrg commented 8 years ago

I am not catch up on all this discussion around conda-forge so I am not sure if this is the best place to do it sorry if not. I am very exited to see progress on this, great work!

With the new conda constructor its really easy to make a custom conda distribution with custom packages from a conda channel. I just tested it with a file like this:

name: centonda
version: 1.0.0

channels:
  - http://repo.continuum.io/pkgs/free/
  - https://conda.anaconda.org/conda-forge

specs:
  - python
  - conda
  - anyjson

At the moment you still need http://repo.continuum.io/pkgs/free/ in the channels list to have python and conda but you can see the idea, if these packages are on the conda-forge channel it would be possible to create a distribution with community created packages.

It would also be possible to make that custom distribution point to the conda-forge channel by default. Not as straight forward but possible, see https://github.com/conda/constructor/issues/16.

Just wanted to mention this as a possibility because I haven't seen anybody discuss this option.

pelson commented 8 years ago

Neat idea @danielfrg. I have scripts which already make the self extracting tarballs (such as miniconda is for Linux and OSX) but not a windows installer. 👍

I think this probably deserves its own issue in this repo though. Happy to open it?

if these packages are on the conda-forge channel it would be possible to create a distribution with community created packages.

Why wouldn't they be - this is a community packaging project 😉 😄

danielfrg commented 8 years ago

I opened a new issue in https://github.com/conda-forge/conda-forge.github.io/issues/90 for tracking.

Why wouldn't they be - this is a community packaging project 😉 😄

Definitely! Thats what I meant, a distribution with only community created packages, all open :)

pelson commented 8 years ago

all open :)

To be fair, this repo does now contain all of the anaconda recipes which are in the conda-build form: https://github.com/ContinuumIO/anaconda-recipes

But I still don't know if that is the canonical repository...

jakirkham commented 8 years ago

@pkgw, noticed that you have a variety of interesting packages some present here and some not yet present (though we are eager to add). Given this is quickly becoming the place to get packages that may not yet be packaged by Continuum and we do the builds in automated CI VMs in very clean environments, I think you might benefit by adding some of your packages here. Also, feel free to sign up for packages that are valuable to your effort. If you need any help figuring out what is going on, please feel free to ping me and I will be happy to get you started. :smile:

conda-forge / conda-forge.github.io

When to package software which is already in the default conda channel #22