Closed bgruening closed 4 years ago
@kyleabeauchamp, @jchodera, @jakirkham @msarahan, @johanneskoester, @daler, @chapmanb, @jxtx, @jmchilton: please feel free to ping others and invite them :)
Definitely interested in learning more! For now, pinging @rmcgibbo, @mpharrigan, @cxhernandez, @marscher, @franknoe, @pgrinaway, @bas-rustenburg.
@bgruening thanks for the message! In fact we just discussed that yesterday!!
Conda-forge was born from two communities similar to bioconda and omnia (the SciTools and IOOS channels) with the goal to reduce redundancy and join forces to produce high quality recipes and binaries. I would love to see more communities join us here. We are not the dark side but we do have cookies :wink: (Well... a cookie cutter... sorry for the false advertisement.)
I am trying to put a blog post online next week with more info. We are also planning on public (Google?) hangouts so we can have some online face-time and QnA sessions.
Meanwhile feel free to ask anything here, or in new issues, if the you have a very specific question.
Here is the gist of conda-forge:
extra/maintainers
field;conda-smithy
. We use that to lint the recipes, update the feedstocks, and some convenience tools to work with the many repos model;There are many details I am leaving out and much more to talk about, but I will stop here for now.
The number one question we get is: why multiple repositories instead of one with all the recipes? We had (and still have) many discussions like this. However, all I have to say is: We tried the single repo model and now we are trying the multiple repos model. So far, the multiple repos has scaled much better, and none of the major worries we had became true.
This sounds great. @rmcgibbo is much more qualified to comment than I am here---he pioneered most of the omnia conda framework---but we ended up converging on our own build system (modeled loosely on conda/conda-recipes
simply because we weren't aware of any other way to tackle this.
Where should we look for all the gory technical details about the build systems and automation? This was the hardest part for us, since we needed (1) broad platform support (hence the use of a phusion/holy-build-box-64 build system for linux
, (2) CUDA and OpenCL support (via the AMD APP SDK), and (3) automated builds in reproducible environments for win
, linux
, and osx
. We're also trying to keep old versions of packages live for scientific reproducibility---we frequently publish code with our papers and provide environment.yml
files to ensure reproducibility with identical versions. Our approach started with actual local hardware and evolved to use cloud services (currently travis-ci
and AppVeyor.
I'd love to understand more about how the conda-forge
build system differs from what we currently use in omnia's build system.
We are not the dark side but we do have cookies :wink:
Which ones? For humans or browsers? :laughing: Ok, it was terrible, but I had no self-control.
Yes, welcome all. :smile:
Please feel free to peruse what is going on at conda-forge and ask questions. The best place to get acquainted with or propose general discussion topics is probably the website repo (in particular the issue tracker). There are many issues there that are likely of interest and welcome to healthy discussion of thoughts and personal experiences. Also, there may be a few closed issues there worth reading up on just to get a little bit of history (we are still quiet young :wink:).
If you would like, feel free to submit a simple recipe or a few to get a feel for how everything works here. Also, feel free to check out our gitter channel for any generic questions or you may have.
Once everyone has had a chance to get a feel for how everything works and what seems personally relevant, we can figure out meeting discussion topics in some place YTBD.
Again welcome.
Welcome @jchodera.
Where should we look for all the gory technical details about the build systems and automation?
This varies depending on the question. Let's try and direct you based on the points raised.
(1) broad platform support (hence the use of a phusion/holy-build-box-64 build system for
linux
This issue has basically moved in the direction of various proposals of how to move the Linux build system forward. Though there is current strategy in place, as well.
(2) CUDA and OpenCL support (via the AMD APP SDK)...
This is under active discussion. The reason being this is tied to several issues including build system constraints, how features work, how and what of these libraries get distributed, etc.. See this issue. There is a proposed example there of how we might get this to work. However, we haven't settled on something yet.
(3) automated builds in reproducible environments for
win
,linux
, andosx
.
This is all over the map. :smile: In general, we use AppVeyor (Windows), Travis CI (Mac), and Circle CI (Dockerized Linux)
If you just want to read code, we can point you there. Proper documentation isn't quite there yet. Also, there isn't one singular issue for this, but it is discussed at various points in various issues. What sort of things would you like to know?
Hi all, checking in from bioconda. I've been poking around the conda-forge code and can't pin down where the magic is happening. Could you point to some code or to a description of what's happening to aggregate the one-recipe-per-repos?
To further the discussion, here's a description of the bioconda build system and where you can find the code.
.travis.yaml
calls scripts/travis-setup.sh
on OSX and Linux, which starts a docker container if linux or does the OSX setup otherwisescripts/build-packages.py
is run. This does most of the work, specifically:
The workflow is just like most anything else on github: submit a PR and wait for it to be tested. Once it passes, someone on the team merges it into master. Upon merging, travis-ci then runs again but on the master branch and this time upon completing, the built packages are uploaded to anaconda.
Aside from differences in the moving parts of the build systems, it sounds like we're all dealing with similar issues with respect to CUDA and gcc, etc. Would be nice to work out some best-practices that we could all use.
Welcome @daler.
Could you point to some code or to a description of what's happening to aggregate the one-recipe-per-repos?
Sorry I'm not following this question. Could you please clarify what you are meaning by aggregate? It is a little unclear and I am a bit worried that there may be some misunderstanding of what is going on here. I'll try to clarify the big picture below.
To further the discussion, here's a description of the bioconda build system and where you can find the code....
Yes, SciTools and IOOS behave in a similar manner. However, those recipes along with many from conda-recipes are being ported over here as people from those groups seem to like this model.
Just to clarify, the model for building is very different here than the many recipes in a single repo. The reasons are varied, but I think the biggest difference is it allows people to take ownership of recipes/packages that are important to them and the tools (CIs) used to test, build, and deploy. This includes making bug fixes, releases, feature support, etc. Similarly it allows relevant discussion to break along those lines. In practice, this appears to be a huge asset. However, there are plenty of other reasons for one to consider this model.
How this works:
While understanding this infrastructure may at first seem daunting, it is actually not so bad and is not really necessary. However, if you are curious, we are more than happy to explain the details.
Maybe if you could please rephrase your question in terms of these steps, we can do a better job at answering your questions and providing you places to look for more information.
Aside from differences in the moving parts of the build systems, it sounds like we're all dealing with similar issues with respect to CUDA and gcc, etc. Would be nice to work out some best-practices that we could all use.
Absolutely, we would be happy to point you to relevant issues where these are being discussed. Just please let me know which of these you would like to know more about.
@daler, aggregation is done at the https://github.com/conda-forge/feedstocks/tree/master/feedstocks repo. This is created with conda-smithy, particularly this module: https://github.com/conda-forge/conda-smithy/blob/master/conda_smithy/feedstocks.py
Continuum is very interested in this particular aspect (I am Continuum's representative here, though others are also involved in contributing recipes and discussing build tools). The one-repo-per-recipe model is necessary, I think, for two reasons:
- keep the load on the CI services small, and avoid their log size and build time limits
- divide responsibilities and authority for each recipe with much finer granularity
The latter is the bigger issue here, since you all have had reasonable success with CI.
Continuum has started a community channel (https://anaconda.org/pycommunity), with the long-term plan to have that as a package aggregation center. In my mind, the most important facet of this effort is to unite the recipes and have a single canonical source for each recipe. I don't care whether it's on some project's page (i.e. matplotlib), or on conda-forge, or whatever - so long as one place is the official source, and finding that source and contributing to it is straightforward. Conda forge is a great place to host recipes because it provides the CI of those recipes, and I like the distributed maintainer model, but I also think that hosting recipes directly at projects, and having conda-forge build from indirectly-hosted sources would be the ideal - that way the recipe would be holistically managed by the package originators.
For the pycommunity channel, we'll mirror or link packages from other channels. In the case of multiple package sources, we haven't quite figured out how to prioritize them (activity level? origin of package?) The hope is that rather than many organizations having to say "add our channel!" we'd instead have just one, and that one may be enabled by default for some "community edition" of miniconda/anaconda - or otherwise could be enabled with conda install pycommunity
@jakirkham and @msarahan thanks for your pointers. One missing piece for me was that submitting a PR to staged-recipes
triggers the CI (only travis, right?) to call .CI/create_feedstocks
, which sets up the infrastructure, tokens etc via conda-smithy
and transforms the repo into something similar to what's in the feedstocks repo of submodules. Is that correct?
@msarahan -- Wholeheartedly agree that a single canonical source for each recipe is critical, and that finding that source and contributing needs to be straightforward. conda-forge/conda-smithy and pycommunity look like great tools to make that happen.
@jakirkham and @msarahan thanks for your pointers.
Glad to help, @daler. Hope it wasn't too much. Just wanted to make sure we had common context for our discussion. :smile:
One missing piece for me was that submitting a PR to
staged-recipes
triggers the CI (only travis, right?)...
When a PR is submitted all CIs (Travis/Mac, Circle CI/Linux, AppVeyor/Windows) are run and used to attempt to build the recipe, but do not release it.
...to call
.CI/create_feedstocks
which sets up the infrastructure, tokens etc viaconda-smithy
and transforms the repo into something similar to what's in the feedstocks repo of submodules. Is that correct?
Once the PR is merged, a Linux job in the Travis CI build matrix does the setup for the feedstock. It goes something like this for each recipe unless otherwise specified (steps 7, 8, and 9).
conda-forge.yml
.gitignore
As you have mentioned, this all basically happens through conda-smithy
. However, there is some code that lives here for that purpose too. Take a look at this log for configparser
and entrypoints
to get a better idea.
After generating a feedstock, a global feedstock update is run. It is pretty simple. It updates the feedstocks with the latest commit of each feedstock on master
at conda-forge. It also updates the listing. However, changes may not be reflected in the listing immediately even if the changes have been made to the HTML source code due to how GitHub caches GitHub Pages.
Perfect, these were just the kinds of details I was looking for. Thanks. Hopefully it can help get others up to speed as they join the discussion as well.
Hi guys, thanks for initiating this. It is very interesting to exchange ideas of how to build. I have two questions:
Have you every considered using the anaconda build service? I recently had a look at it, and it seems to me centered on packages instead of repositories/organizations, which is kind of unfortunate, because it needs to be set up for each package, right?
Yes, especially for Windows builds. Mapping conda-forge's model to Anaconda.org should be OK - the organization would be conda-forge, and each package would be a different build. Maybe I'm missing how this is different from the other CI services? Anyway, the hangup has been that anaconda.org has some kinks that need to be worked out.
With your conda-forge model, how do you deal with dependencies between recipes?
ATM, I think the answer is "we don't." There has been discussion about coming up with networkx-driven guidance of what recipes to work on next, but that has been for human consumption more than automated buildout of dependency trees. Before getting involved in conda-forge, Continuum developed a build script that also uses networkx, and builds out these trees. That code assumes a single folder of packages, which can be created from conda-forge using conda-smithy. The dependency building code is part of ProtoCI: https://github.com/ContinuumIO/ProtoCI/blob/master/protoci/build2.py
Thanks for the clarification. My point is the following: if the anaconda build service could be setup per repository and not per package, CI job limits are no reason any more to have separate repositories per recipe, right?
I think separate repos per recipe are still a good thing, because it gives you complete control over who has permission to accept changes to a recipe. I don't know how we'd do that with many recipes under one umbrella.
Before getting involved in conda-forge, Continuum developed a build script that also uses networkx, and builds out these trees. That code assumes a single folder of packages, which can be created from conda-forge using conda-smithy. The dependency building code is part of ProtoCI: https://github.com/ContinuumIO/ProtoCI/blob/master/protoci/build2.py
Would this work on the feedstocks
repo possibly with some tweaks? This might be a good way to get things going and it would also avoid having several scripts created here that kind of do something like this. Thoughts?
Sure, I think so. It would need to be adapted to look into the nested recipes folder, but I think otherwise, it would work fine. It may also have trouble with jinja vs. static version numbers - but again, that's tractable.
@msarahan I agree, this is in general a nice advantage. I asked, because the situation is different for bioconda. There, we have a rather controlled collaborative community, and it is much more convenient to have all recipes in one repository (e.g. for toposorting builds).
Yeah, the one thing we don't have figured out well yet is how to edit multiple recipes at once. For aggregating them and building them as a set, I think conda-smithy + ProtoCI abstract away the difficulties with one repo per recipe.
But if you build them as a set, you have the problem with job limits in the CI again, haven't you?
Yeah, I figure the nested directory structure needs to be addressed. Otherwise adding jinja template handing is probably valuable no matter where it is used, no?
adding jinja template handing is probably valuable no matter where it is used, no?
Absolutely. In case you missed it, @pelson has a nice snippet at https://github.com/conda-forge/shapely-feedstock/issues/5#issuecomment-208377012
But if you build them as a set, you have the problem with job limits in the CI again, haven't you?
Well, one could consider some sort of debouncing to handle this. Namely even though one has made the change together and one is submitting them all ultimately, we manage submissions/builds somehow so that they are staggered. This will likely require some thought, but it is useful for some workflows with the recipes.
But if you build them as a set, you have the problem with job limits in the CI again, haven't you?
With anaconda.org, we don't have artificial limits. There are still strange practical limits - like logs that get too large end up making web servers time out. These are tractable problems.
Interesting, thanks for the link. I'll take a closer look.
@msarahan, I know, you don't have these limits, but my understanding was that anaconda.org cannot out of the box build recipes as a set, right? You have to register an individual trigger for each of them? And then, their order of execution is no longer determined, and they can't depend on each other. Or am I missing something here?
@johanneskoester there would need to be some intermediate representation as a collection of recipes. Then that ProtoCI tool would then be able to build things that changed. It is written to build packages based on which packages are affected by a git commit. Here, obviously only one recipe could trigger, rather than many changing at once. That does not affect its ability to build requirements dependencies, though - and they'll be built in topological order.
@msarahan, maybe this is not the right thread (I don't want to bother the rest with my detail questions here, so feel free to stop me if you feel this becomes off-topic). Ok, but protoCI has to run e.g. on travis, right? That means, even if protoCI triggers builds on anaconda.org in the right order, travis would still need to wait on the results, in order to be able to report back to github? Which would result in the same timeout issues? Sorry if I misunderstand something here.
ProtoCI was designed to run on anaconda.org. If anyone else wants to get it to run on Travis, that's cool, but that wasn't what it was written for. It would not be triggered by Travis or any other build - rather, it would be a new CI service in addition to or instead of the existing CI services.
Great! Now it makes sense, sorry I did not know that. So, is there any documentation on how we could set up ProtoCI for bioconda? We already have a repository with multiple recipes in place. Or is that not possible yet?
I'm not completely sure how possible it is. I have been involved, but not the one doing most of the real work. It is live on conda-recipes, and you should start with the binstar.yml there as an example, but you'll have to tweak it for your build queue:
https://github.com/conda/conda-recipes/blob/master/.binstar.yml
In short, protoci should be installed on the build workers. It is on ours, but I can't speak for how you have your queue set up. Your build script should just call the protoci-difference-build entry point.
Thanks Mike, that's good news!
A question regarding conda-forge: On which linux system/libc version do you build?
That's sort of in flux. See https://github.com/conda-forge/conda-forge.github.io/issues/29
I think the current one is this: https://github.com/pelson/Obvious-CI/blob/master/obvious-ci.docker/linux64_obvci/Dockerfile
Thanks for this healthy discussion, exactly what I wanted to trigger :)
One additional question from me. Does conda-forge has any other channels (default
I assume) activated during build time?
Does conda-forge has any other channels (
default
I assume) activated during build time?
Nope. Only the default
channel.
Just to say that I'm in favour of the single repos as it currently is in conda-forge/feedstocks. Although I didn't go as ambitious as the bioconda/ioos/scitools/omnia crowd I've also been maintaining a set of recipes that we needed for our project, menpo. Most importantly, I've been really trying to drive Windows support because so many people in Computer Vision still use Windows due to historical (Matlab) reasons. So I'm usually keen to try and help with upstream support for Windows (as well as @msarahan who has been the real Windows champion).
I'm very interested in the CUDA/OpenCL builds that you guys seem to have. I wonder if we could become to the goto place to pickup projects like Theano for deep learning?
Just to say that I'm in favour of the single repos as it currently is in conda-forge/feedstocks
Thanks for the honesty. Technically conda-forge/staged-recipes is a many recipe repository too, it is just we've automated it so that recipes immediately get deleted and added to their own repository on merge. 😉
With that in mind, you may be aware of conda-build-all which is the tool we use in this repo to build all recipes in a single repo. It (and its predecessor ObviousCI) was the tool we used in IOOS and SciTools (amongst others) to build and upload to our respective channels. Because of staged-recipes dependence upon it, we are going to need to continue to maintain that capability, so if your looking for shared tooling for the single-repo many-recipe usecase, you might want to take a look.
As you've highlighted, even if you don't favour the approach we have taken at conda-forge, there is still huge potential for us to collaborate so that we can collectively package in a consistent and coherent way. Your input so far has been exceptionally valuable, and long may it continue! 👍
As you've highlighted, even if you don't favour the approach we have taken at conda-forge, there is still huge potential for us to collaborate so that we can collectively package in a consistent and coherent way. Your input so far has been exceptionally valuable, and long may it continue! :+1:
I would like to highlight this and raise 3 points to start a closer cooperation.
@bgruening this sounds very reasonable. General purpose libraries can go into conda-forge, but I can also imagine to just pass them over to Continuum. I am also not quite sure about whether to prefer conda-recipes or conda-forge for that... Can I expect that conda-forge PRs are handled faster than on conda-recipes?
Can I expect that conda-forge PRs are handled faster than on conda-recipes?
This was my expectation. One huge benefit of the bioconda model is that if you need a recipe in 30min you get get those. conda-forge is hopefully way faster than conda-recipes :)
Can I expect that conda-forge PRs are handled faster than on conda-recipes?
This was my expectation. One huge benefit of the bioconda model is that if you need a recipe in 30min you get get those. conda-forge is hopefully way faster than conda-recipes :)
If you are proposing a new recipe, then you have to wait on one of the conda-forge "staged-recipes" maintainers.
If you are proposing an update to an existing recipe, then you have to wait on one of the feedstock maintainers (as listed in the recipe/meta.yaml).
If you are proposing an update to an existing recipe for which you are a maintainer, you are waiting on yourself (and maybe the CI to finish).
In general I'm keen to be very open about membership of the "staged-recipes" group. The important qualities of a maintainer in that team is an eye for detail, a feeling for what is "maintainable", and shed loads of experience of reading and writing conda recipes. I suspect most people in this thread meet that criteria, and after proposing just 3 or 4 PRs which merge smoothly, I'd be happy to say that was a good candidate for membership (though with all the noise on conda-forge, it is probably necessary to ask, rather than have it suggested).
Hey guys, I have ported a couple of my own recipes from bob.conda to conda-forge
and I have used this script to port things over. It is not well written and I am sure you can write better scripts but as @jakirkham mentioned maybe sharing it with you guys could help you automate your porting process.
Thanks for sharing this, @183amir. :smile:
I guess the most important part is that I load recipes with ruamel.yaml and take the example
recipe as base and update it with my own recipe.
At bioconda, we are evaluating if it makes sense to specify the compiler in the recipe. Our feeling is that this provides some advantages, e.g.
We are unsure though about osx (since clang seems to be used there by e.g. the default channel). What are your thoughts on this? It looks like no conda-forge recipe is currently depending on gcc.
Hi @johanneskoester, sorry your comment got buried in notifications I am afraid and am now discovering while going back through some things.
Generally, our feeling here is we want to move away from using a gcc
package. We have been making steps in that direction. In particular, we now use CentOS 6 with devtoolset-2 for nearly all of our building on Linux. On Mac, we require 10.7 where using the system compiler (clang
) has C++11 support. This largely meets our needs at present. There are a few exceptions when dealing with OpenMP and/or Fortran. Though we may re-evaluate our strategy here in the future. See this issue ( https://github.com/conda-forge/conda-forge.github.io/issues/29 ) for more details on various proposals.
While it is a nice idea in theory to use a compiler package, in practice this doesn't fair so well. One reason is we can't make any guarantees about GLIBC compatibility on Linux as the compiler could be used to build anywhere. So, we have opted to work on proper docker containers that include a compiler in them. Another reason this doesn't work well is that if we run into an issue with the compiler package (as we did recently), we are largely incapable of fixing it due its long build time that exceeds CI limits. As a result, we can at best use kludgy hacks to try and solve the problem. In the worst case, we find ourselves crippled.
Maybe a better long term strategy that provides the same guarantees without the same issues would be to create a pseudo compiler package. This package could be used to verify that a compatible compiler can be found. Additionally, this package could be used to perform some sort of configuration to ensure the compiler it found is used. This would allow proper constraints in an explicit manner, but avoid the pains associated with the packaged compiler.
Thanks for the answer! We also used devtools-2 before, but unfortunately we went in exactly the opposite direction, requiring the gcc package for all recipes that compile something. Regarding the gcc issues: They appear to be in the conda-forge gcc package, right? Why do you shadow the gcc package from the default channel at all? That one also should not have the CI issues, because Continuum builds on anaconda.org, right?
Thanks for the answer! We also used devtools-2 before, but unfortunately we went in exactly the opposite direction, requiring the gcc package for all recipes that compile something.
I see. Well, we are partially following @msarahan's lead here. Though maybe a bit slower than he would like. He has made a strong case for not using the packaged gcc.
Regarding the gcc issues: They appear to be in the conda-forge gcc package, right?
Nope. We tried to package it because of the problems we had with it, but we couldn't.
Why do you shadow the gcc package from the default channel at all?
As stated before, we don't.
That one also should not have the CI issues, because Continuum builds on anaconda.org, right?
Not sure I follow this question. Could you please clarify what you mean here?
We all love conda and there are many communities that build awesome packages that are easy to use. I would like to see more exchange between these communities to finally share more build-scripts, to develop one best-practice guide and finally to have channels that can be used together without breaking recipes - a list of trusted channels with similar guidelines.
For example the bioconda community - specialised on bioinformatic software. They have some very nice guides how to develop packages, they are reviewing and bulk-patches recipes if there are new features in conda to make the overall experience even better. ping @johanneskoester, @daler and @chapmanb from BioConda fame
Omnia has a lot of cheminformatic software and a nice build-box based on phusion/holy-build-box-64 + CUDA and AMD APP SDK. ping @kyleabeauchamp, @jchodera
With conda-forge there is now a new one and it would be great to get all interested people together to join forces here and don't replicate our recipes or copy them from one channel to the other just to make them compatible.
Another point is that we probably want to move recipes to
default
at some point and deliver our work back to Continuum - so that we can benefit from each other.I can imagine that we all form a group of trusted communities and channels and activate them by default in our unified build-box - or we have one giant community channel. All this I would like to discuss with everyone that is interested and come up with a plan how to make this happen :)
What do you all think about this? As a next step I would like to create a doodle to find a meeting data where at least one representative from all communities can participate.
Many thanks to Continuum Analytics for there continues support and the awesome development behind scientific python and this package manager. ping @jakirkham @msarahan