A new YAML based format for "conda-build" files

wolfv commented 1 year ago

✏️ Rendered version here: https://github.com/wolfv/ceps/blob/cep_20.1/cep-20.1.md

marcelotrevisani commented 1 year ago

I think that would be very useful, it will reduce the complexity to parse those yaml comments that are a pain. But I also think that if we could move to use toml files instead of yaml mixed with jinja2 would be way better but that is another history 😬

wolfv commented 1 year ago

How would TOML files work? Wouldn't we still need some Jinja here and there?

The good thing about the pure YAML format is that we can translate it to either the next evolution of the YAML format or a TOML based format quite easily.

In fact, someone has already added TOML support (more specifically recipe embedding inside a pyproject.toml file in boa - precisely because it is possible).

marcelotrevisani commented 1 year ago

How would TOML files work? Wouldn't we still need some Jinja here and there?

The good thing about the pure YAML format is that we can translate it to either the next evolution of the YAML format or a TOML based format quite easily.

In fact, someone has already added TOML support (more specifically recipe embedding inside a pyproject.toml file in boa - precisely because it is possible).

regarding toml files I was thinking preprocess it and postprocess it if necessary something like poetry but on steroids, 😆 like

[requirements.host]
pytest = {version=">=5.0.0,<6.0.0", platform=["linux", "osx"], preprocess="func_to_preprocess", postprocess="func_to_postprocess"}

if it needs preprocessing it could call some predefined functions to do some magic that we might need

if we need variables to be used inside it can use

[env.variables]
my_var="abc"

[package]
name="${my_var}"

and it can inject it and expand when processing the recipe, if I am not mistaken there is a crate that does these kinda of env var expansion.

but as I mentioned, that is another whole history 😄

What you have proposed I think would improve and simplify a lot of parsing and when we need to pragmatically create the recipes. But I wish to get ride of all the jinja, because they are also a pain sometimes 😆 , however, I also understand that would be very difficult to remove them

travishathaway commented 1 year ago

One thing that also came to mind as I was reading the CEP was what the "migration strategy" would be for existing recipes. You mention this a little in the "short comings" section, but maybe it's worth it to have its own section.

Unfortunately, migration will be a PITA because it will most likely involve manually updating recipes (will it?), but maybe there's a tool we could write that could handle translating 90% of the of existing recipes (I'm thinking like a Python 2 to 3 migration tool).

Or, maybe migration isn't even a concern 🤷‍♂️. It should probably be mentioned somewhere in the document.

wolfv commented 1 year ago

Thanks for the comments @travishathaway. Indeed, migration is a concern. In boa we have a small & hacky tool that can convert simple recipes (boa convert meta.yaml > recipe.yaml).

It works by analyzing a line, and if it finds a comment it moves the comment right after the - and wraps it with a sel(...). It also parses the {% set ... expressions and sticks them in a context dictionary.

But obviously this works only for relatively straightforward cases.

JeanChristopheMorinPerso commented 1 year ago

I'm curious, the title says "A new recipe format – part 1". Is there a part 2 somewhere or some written notes about a possible future part 2?

wolfv commented 1 year ago

Hi @JeanChristopheMorinPerso indeed, there is the "full" proposal (of which this is a derivative) here: https://github.com/conda-incubator/ceps/pull/53

Also, all this work is based on 2+ years of work on the boa "build tool" as well as the rattler-build build tool. Both are many times faster than the current conda-build and implement this recipe format :)

wolfv commented 1 year ago

The current state of affairs is that we have 2 mini-languages + YAML (comments & Jinja) for the recipes. That makes it really hard to write a parser for it to statically extract information from the recipe or transform it in any way.

The way Jinja support is implemented in conda-build is also quite tricky: it does recursive evaluations of the recipe to find all the necessary values for the used Jinja variables. This contributes greatly to conda-build being extraordinary slow for larger recipes (especially with multiple outputs).

I think the way Jinja is used in conda-build currently is not very deliberate. It's usage is completely unrestricted, but most recipes only use a tiny subset (string interpolation & one or two "set" statements). However, the possibilities of Jinja blur the line between: is this YAML file a script or is this YAML file a static description of the package? Right now it's both, somehow, and not good at either.

If we have decide on a somewhat more limited YAML format as input, it will be very straight-forward to generate the recipes using Python or any other language (we are actually doing this for ROS recipes in the RoboStack project where we've been using boa for 2+ years now). So that can be an "escape-hatch" for some recipes that are more annoying to write without full Jinja support. At the same time, the old recipe format will be supported indefinitely so it's also no problem to stay with what is there.

Another nice property of the new recipe format will be the effortless evolution of it. Since it's static YAML we can come up with version 2 and make a migrator that rewrites the recipe to the new spec - something that's super hard to do with the current format.

So to summarize the motivations:

we get rid of a hybrid script/static thing
we remove many complications in parsing & evaluating recipe files
we only loose functionality for very few recipes for which we can figure out better ways
generating and migrating recipes will be super straightforward

wolfv commented 1 year ago

Another motivating example: we have a version autotick bot in the boa-forge collection of recipes and it requires very little code (~600 lines of Python). The part about reading & writing the recipe is trivial: https://github.com/mamba-org/boa-forge/blob/de9a92f755f21f736e380b39bba925c755b1e35f/determine_versions.py#L607-L648

It produces PRs such as: https://github.com/mamba-org/boa-forge/pull/131/files

baszalmstra commented 1 year ago

Overall I think this is a great step. Having a more easily machine-readable recipe would open the doors for more tools to generate and parse recipes.

I do also prefer the aforementioned if syntax over the sel(..) syntax because:

No need for the weird key formatting. sel(..) as a key looks strange in yaml land.
It's the same approach used by Github CI.
I think it's a bit easier to read.
We still use minijinja (or whatever) to evaluate the selectors which is still a second language. If we use the if syntax we could later also extend that to make it more machine-readable as well!
```
if:
    or:
        - platform: win32 # not a minijinja expression
        - platform: osx
```

This is what it could look like (snippet is actually created by @wolfv , I only changed it slightly). Note that the value for if is also just a minijinja expression just without the {{ }} (this is also similar to Github CI).

context:
  name: xtensor
  version: 0.24.6
  sha256: f87259b51aabafdd1183947747edfff4cff75d55375334f2e81cee6dc68ef655

package:
  name: "{{ name|lower }}"
  version: "{{ version }}"

source:
  fn: "{{ name }}-{{ version }}.tar.gz"
  url: https://github.com/xtensor-stack/xtensor/archive/{{ version }}.tar.gz
  sha256: "{{ sha256 }}"

build:
  number: 0
  skip:
    - if: osx or win # note that the value is a minijinja expression
      value: true
    - if: win
      # Defaults to: (because we can easily derive this from the context) 
      # value: true

requirements:
  build:
    - "{{ compiler('cxx') }}"
    - cmake
    - if: unix
      value: make
  host:
    - xtl >=0.7,<0.8
  run:
    - xtl >=0.7,<0.8
  run_constrained:
    - xsimd >=8.0.3,<10

test:
  commands:
    - if: unix
      value:
        - test -d ${PREFIX}/include/xtensor
        - test -f ${PREFIX}/include/xtensor/xarray.hpp
        - test -f ${PREFIX}/share/cmake/xtensor/xtensorConfig.cmake
        - test -f ${PREFIX}/share/cmake/xtensor/xtensorConfigVersion.cmake
    - if: win
      value:
        - if not exist %LIBRARY_PREFIX%\include\xtensor\xarray.hpp (exit 1)
        - if not exist %LIBRARY_PREFIX%\share\cmake\xtensor\xtensorConfig.cmake (exit 1)
        - if not exist %LIBRARY_PREFIX%\share\cmake\xtensor\xtensorConfigVersion.cmake (exit 1)

wolfv commented 1 year ago

I honestly don't care so much for either approach. I think the sel(...) syntax is more terse which might be nice.

There is one feature that is hard to express with the if syntax which is a dictionary with multiple keys, e.g.:

build:
  number: 0
  script:
    sel(unix): |
      cmake ${CMAKE_ARGS} -DBUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX=$PREFIX $SRC_DIR -DCMAKE_INSTALL_LIBDIR=lib
      make install
    sel(win): |
      cmake -G "NMake Makefiles" -D BUILD_TESTS=OFF -D CMAKE_INSTALL_PREFIX=%LIBRARY_PREFIX% %SRC_DIR%
      nmake
      nmake install

However, there was also a bit of internal debate about this because we would need to define the behavior if more than one sel(...) evaluates to true. In that case, in the current implementations, only the first one wins.

baszalmstra commented 1 year ago

I don't see why that would be hard. Instead of a dictionary, just use a list:

build:
  number: 0
  script:
    - if: unix
      value: |
        cmake ${CMAKE_ARGS} -DBUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX=$PREFIX $SRC_DIR -DCMAKE_INSTALL_LIBDIR=lib
        make install
    - if: win
      value: |
        cmake -G "NMake Makefiles" -D BUILD_TESTS=OFF -D CMAKE_INSTALL_PREFIX=%LIBRARY_PREFIX% %SRC_DIR%
        nmake
        nmake install

However, I think we could do even better if the selector evaluation part is part of the model. The current implementation in Rust does two passes over the yaml. First, it evaluates the selectors which generates a new yaml file and then it parses it into a specific model format. While flattening the structure it can become unclear whether a list, a string, or an object was expected.

I think it would be better to make the selectors part of an explicit model, evaluate that model with the selectors, lowering it into another (evaluated) model. This will most likely also give the user much better error messages if two selectors accidentally result in two valid values. It also allows us to generate a JSON schema that can be used in IDE's to get autocompletion support.

wolfv commented 1 year ago

The biggest fear of the if: ... based syntax is that we now think that it's more "correct" or something like that, but then it might present a hurdle to adoption because in effect it's longer / more annoying to type :)

wolfv commented 1 year ago

We had some more discussions today and came up with some changes:

instead of forcing the user to quote "{{ jinja }}" variables (which always coerces them to be strings in YAML) we would rather go the Github route and use ${{ jinja }} (which plays well with YAML), or use another bracket, like ((. However we're leaning towards the Github syntax.
this also enables us to use the ternary syntax in places where the YAML if would be complicated to use. Moreover, the ternary if is infinitely nestable. E.g. this is possible:
```
build:
number: ${{ 0 if win else 100 }}
```
and has the correct integer type etc.

We also looked at the challenge of the following example: suppose win needs an MD5 hash, and unix needs a SHA256 for a given source file (a bit contrieved, admittedly). With the if or sel syntax, the cleanest would be to repeat the block like:

source:
  - if: win
    url: https://github.com/killercup/cargo-edit/archive/refs/tags/v{{ version }}.tar.gz
    sha256: 46670295e2323fc2f826750cdcfb2692fbdbea87122fe530a07c50c8dba1d3d7
  - if: unix
    url: https://github.com/killercup/cargo-edit/archive/refs/tags/v{{ version }}.tar.gz
    md5: 0800fc577294c34e0b28ad2839435945

Although in this case it would be possible to solve with a Jinja if as well.

For lists, the syntax would look like:

    - if: win
      then:
      - cmake
      - ninja
      else:  # optional
      - ...

The missing then makes the syntax nicer for dictionaries, but also make it less homogenous.

jakirkham commented 1 year ago

Thanks Wolf! 🙏

Really appreciate you taking the time to write this up with nice examples. Think there are a lot of good improvements here. The biggest simply being in one markup language

Asked a few questions and made a few minor suggestions above. Will probably need to mull on this more and bring a few more examples into the discussion so we can think about what it would mean to move recipes into a format like the one proposed here

Another question that comes up for me frequently, which may be off-topic for this, but I'd like to at least put it out there. The multi-output format that we have today was a bolt on to an existing single output format. Though I wonder if we could do better by moving to some kind of model where the multi-output format is a first-class citizen. In any event this may be CEP in its own right, but I think this would be a valuable discussion to have and would be very beneficial for recipe maintainers. Something to think about 🙂

wolfv commented 1 year ago

Hi @jakirkham thanks for your reply! Indeed, multi-outputs are one of the things I desperately want to fix and it's going to be the topic of another CEP. In that CEP, the idea will be to make the outputs completely independent (no weird inheritance of dependencies etc.) and have no-package outputs (that other outputs can depend on as a build cache).

This one here focuses on the YAML format for future recipes.

We did go back to the drawing board a little and came up with the following:

Allow the conditional if syntax only in lists, such as dependency, script or or run exports. In all other places, users can use the "ternary" operator of MiniJinja. This will simplify the implementation quite a lot.

This is what an example recipe with proposed syntax would look like:

context:
  name: xtensor
  version: 0.24.6
  sha256: f87259b51aabafdd1183947747edfff4cff75d55375334f2e81cee6dc68ef655

package:
  name: ${{ name|lower }}
  version: ${{ version }}

source:
  fn: ${{ name }}-${{ version }}.tar.gz
  url: https://github.com/xtensor-stack/xtensor/archive/${{ version }}.tar.gz
  sha256: ${{ sha256 }}

build:
  number: 0
  skip:
    # special case if -- raw "statements"
    # note that the value is a minijinja expression
    - osx or win 
    - win
  # alternative syntax
  # script: ${{ "build_cmd.sh" if unix else "build_cmd.bat" }}
  script:
    - if: unix
      then: |
        cmake ${CMAKE_ARGS} -DBUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX=$PREFIX $SRC_DIR -DCMAKE_INSTALL_LIBDIR=lib
        make install
    - if: win
      then: |
        cmake -G "NMake Makefiles" -D BUILD_TESTS=OFF -D CMAKE_INSTALL_PREFIX=%LIBRARY_PREFIX% %SRC_DIR%
        nmake
        nmake install

requirements:
  build:
    - ${{ compiler('cxx') }}
    - cmake
    - if: unix
      then: make
  host:
    - xtl >=0.7,<0.8
  run:
    - xtl >=0.7,<0.8
  run_constrained:
    - xsimd >=8.0.3,<10

test:
  commands:
    - if: unix
      then:
        - test -d ${PREFIX}/include/xtensor
        - test -f ${PREFIX}/include/xtensor/xarray.hpp
        - test -f ${PREFIX}/share/cmake/xtensor/xtensorConfig.cmake
        - test -f ${PREFIX}/share/cmake/xtensor/xtensorConfigVersion.cmake
    - if: win
      then:
        - if not exist %LIBRARY_PREFIX%\include\xtensor\xarray.hpp (exit 1)
        - if not exist %LIBRARY_PREFIX%\share\cmake\xtensor\xtensorConfig.cmake (exit 1)
        - if not exist %LIBRARY_PREFIX%\share\cmake\xtensor\xtensorConfigVersion.cmake (exit 1)

vyasr commented 1 year ago

Hi all, I'm a RAPIDS developer. I am still evaluating the impact of this proposal more generally and can post more later depending on how we see this impacting our packaging overall, but I see a few comments regarding the potential benefits of using "pure YAML" and I can add my two cents here and give a specific example as to why that particular change could be of use (also a :+1: to @wolfv's example of a generated meta.yaml from RoboStack). We've been doing something similar but focusing primarily on dependency management for now in order to single source our dependency lists between multiple Python packages and C++ libraries contained in a monorepo, each of which must product some combination of environment.yaml/meta.yaml/pyproject.toml files in addition to some other more bespoke use cases. We have support for many of the use cases we care about, but we haven't supported conda recipes due to the complexities of parsing a file that is not pure YAML in a context where we may not have all the information needed to evaluate the full Jinja context (i.e. if there are variables available during a conda build that we don't know outside that environment). The best approach we've considered so far is to simply generate many small YAML files containing dependency lists corresponding to each section and then loading them with a Jinja for loop, but that's a pretty rough solution. Having a pure YAML file would make producing the dependency lists dynamically much simpler as stated in the CEP:

However, since the new recipe format is "pure YAML" it is very easy to create and pre-process these files using a script, or even generating them with Python or any other scripting language. That means, many of the features that are currently done with Jinja could be done with a simple pre-processing step in the future.

Note that pulling dependencies from pyproject.toml into meta.yaml is generally insufficient because the two sets of dependencies generally will not match exactly due to various factors (pyproject's build-requires typically splits into conda's build + host, non-Python library dependencies are typically not available on pip, differing package names or metapackage structures between conda and PyPI, etc). You may reduce some apparent redundancy, but in practice you still have to think about both files, and you have now coupled them in a nontrivial way since logic in one file now depends on the contents of the other.

Unrelated, strong :+1: to improving multi-output support, but agreed that it's out of scope here and don't want to divert the discussion. Thanks for putting this together @wolfv!

isuruf commented 1 year ago

@wolfv, can you link the previous discussions? I remember I commented some of my thoughts on an issue, but can't seem to find it.

wolfv commented 1 year ago

Hmm, @isuruf this was the original "discussion document": https://hackmd.io/axI1tQdwQB2pTJKt5XdY5w

Maybe we had some more discussion in one of the conda-forge meetings or on the conda-build issue tracker?

I found some discussion here: https://github.com/conda-forge/conda-forge.github.io/issues/1612

And here are all the boa mentions in conda-forge meetings.

wolfv commented 1 year ago

I've update the CEP with the new syntax as discussed. We'll be implementing the ListWithConditionals in rattler-build today and give it a try :)

sterlinm commented 1 year ago

This sounds very promising!

One thing I'm not entirely clear on is on is whether the Jinja2 functions for loading data from other files described here will continue to work. I'm guessing yes?

wolfv commented 1 year ago

@sterlinm I think in most implementations they would not be supported initially. Especially load_setup_py will definitely not be support in the Rust version (rattler-build).

However, with the new format it is easier than ever to write a small script to create recipes from a script.

Also note that the proposal doesn't allow for something like {% set contents = load_pyproject_toml(...) %} because that syntax doesn't parse as YAML.

We could support some form of this inside the context, but I am not sure on the specifics yet.

sterlinm commented 1 year ago

@wolfv Thanks, that all makes sense!

For scenarios where you want to build a conda recipe and pypi project from the same list of packages, I agree that being able to generate both from simple scripts sounds better than having lots of magic functions in the recipe at the expense of the other goals you've laid out.

wolfv commented 1 year ago

Hi Marcel, big sorry for taking so long to get back on the points!

Yes, maybe it's better to say "Jinja syntax"
The plan is to evaluate YAML first, and then Jinja. So keys can't contain Jinja. I should probably add that to the proposal if it wasn't clear.
There is indeed an evaluation order in the context (it's parsed as an OrderedDict and evaluated top-to-bottom). I thought I had specified that somewhere, but I didn't mention anything about it having to be acyclical. I also didn't really think about it that way as I thought that it would be obvious that it's evaluated top-to-bottom (nothing fancy).
Selectors in the new spec are only allowed in lists - the idea is that we would have a special "ConditionalList" class in the implementation that looks for the if / else and takes care of them. In the initial version even nested if's would probably be forbidden. The goal was that we could make it more type-safe this way.
Regarding types in expressions ... I've looked around in MiniJinja and it seems fairly hard to make it type-safe or "typed". Maybe we could do it. I am not sure I want it as a requirement at this point as it might hold up the implementation for too long. Maybe in the future we should actually think about a proper language (like Nickel or so).
How about we move from cmp to a cfg function? That might be similar to how Rust / Cargo handles additional target dependent configuration like cfg(python = ">3.10") could be the syntax that we use. In Rust you can use additional operators like cfg(all(python="3.10", numpy=">1.5")). Although I cannot promise that we can stick that into Jinja (minijinja)... The other way around (all(cfg(python="3.10"), cfg(numpy=">1.5"))) might be easier to implement.

I'll continue later!

wolfv commented 1 year ago

OK, continuation:

The point you raise about YAML and YAML being complex: totally agree. I don't think anybody uses (and I also don't understand) anchors and multi-documents so I am happy to explicitly forbid them. However, I do think there are places where integers are required (build number). Floating point numbers might indeed not appear in a recipe. But I do think that we can (should) enforce versions to be quoted as strings. With Rust, it's quite easy to enforce with serde_yaml, and our pydantic types also do it (https://github.com/prefix-dev/recipe-format it became even better with pydantic 2!).

As you alluded to, I am open to defer the final decision on the cmp function (or cfg or something else entirely). So I am considering to remove it from the spec for now in order to move past the initial spec.

One last point that could be worthwhile to discuss is wether we want to take more inspiration from wolfi-os / melange, where packages look a bit more like Github Actions: https://github.com/wolfi-dev/os/blob/main/libidn2.yaml I really like it, but at the same time the hurdle to port existing recipes will be higher and the implementation is a bit different ... but in the future we could have a pipeline.yaml that does it that way :)

dholth commented 1 year ago

I'm excited about the idea of a more parseable recipe format.

My first thought is that I'm not sure conda-build's special jinja2 functions are well-documented?

Too bad the existing format isn't versioned.

I'm more interested in parsing info/recipe/meta.yaml inside a .conda. The original yaml+jinja2 recipe becomes info/recipe/meta.yaml.template.

How is our support for a separation between the recipe renderer and the rendered-recipe builder?

wolfv commented 11 months ago

Based on feedback received, I have just removed mentions to the cmp function. We should have a seperate discussion about it.

I am ready to call for a vote on this CEP! :)

wolfv commented 11 months ago

Not sure what to do but I just updated the dates and "changed" the status to proposed which means that the vote is open!

I mentioned it in the internal Gitter, but our implementation of the new format is pretty complete and works nicely in rattler-build. We are also in the process of updating its documentation to reflect the format changes we discussed here and that are now implemented.

jezdez commented 11 months ago

@conda-incubator/steering

This vote falls under the "Enhancement Proposal Approval" policy of the conda governance policy, please vote and/or comment on this proposal at your earliest convenience.

It needs 60% of the Steering Council to vote yes to pass.

To vote, please leave yes, no or abstain as comments below.

If you have questions concerning the proposal, you may also leave a comment or code review.

This vote will end on 2023-10-27, End of Day, Anywhere on Earth (AoE).

baszalmstra commented 11 months ago

yes

jaimergp commented 11 months ago

Yes.

jezdez commented 11 months ago

yes

ocefpaf commented 11 months ago

yes

xhochy commented 11 months ago

yes

wolfv commented 11 months ago

yes

beckermr commented 11 months ago

my vote doesn't count, but I am 100% supportive of this and very excited to see it moving ahead!

marcelotrevisani commented 11 months ago

Yes! 🚀

msarahan commented 11 months ago

Great work on this. Yes.

kkraus14 commented 11 months ago

Yes

mariusvniekerk commented 11 months ago

yes

goanpeca commented 11 months ago

Yes

wolfv commented 11 months ago

Wow, I am counting 11 yes votes (out of 15 council members) which appears to me as if the proposal is accepted!

jezdez commented 11 months ago

@wolfv We should wait until the end of the voting period to tally all votes, FWIW

mbargull commented 11 months ago

Yes.

.. But ideally with the main points from https://github.com/conda-incubator/ceps/pull/54#pullrequestreview-1515301216 addressed. I'll try to compose a changeset and issue a pull request against this one (if not done by tomorrow, it will have to wait for a subsequent CEP).

wolfv commented 11 months ago

I guess we can now celebrate that this has been accepted?! :) :tada: I'll update and merge the CEP tomorrow with the appropriate number and status.

marcelotrevisani commented 11 months ago

I guess we can now celebrate that this has been accepted?! :) 🎉 I'll update and merge the CEP tomorrow with the appropriate number and status.

Can I start to modify souschef and grayskull already? 😄

wolfv commented 11 months ago

Voting Results

This was a standard, non-timed-out vote.

Among Steering Council members there are 12 "yes", no "no", and no abstentions. This vote has reached quorum (at least 60% of 15).

It has passed since it recorded 12 "yes" votes and 0 "no" (80% yes out of 15 eligible voters).

dholth commented 11 months ago

Does this diverge from the rattler book - sel(osx): somepkg instead of - somepkg # [osx] , Jinja string interpolation needs to be quoted at the beginning of a string, e.g. - "{{ version }}"

baszalmstra commented 11 months ago

Does this diverge from the rattler book - sel(osx): somepkg instead of - somepkg # [osx] , Jinja string interpolation needs to be quoted at the beginning of a string, e.g. - "{{ version }}"

@dholth Yes it does, that is not correct. That should have been updated in the book, can you point me to where it says that?

dholth commented 11 months ago

https://prefix-dev.github.io/rattler-build/recipe_file.html

conda / ceps

A new YAML based format for "conda-build" files #54

Voting Results