INTERSECT-training / packaging

Module introducing Python packaging and distribution
https://intersect-training.org/packaging/
Other
3 stars 3 forks source link

check and unify use of extras #67

Open hollandjg opened 2 weeks ago

hollandjg commented 2 weeks ago

we've a mixture of test, docs, dev extras in the episodes.

We need to be able to deal with:

Option 1: Modular

Extras:

Pros: each piecce is compartmentalized Cons: Users need to understand a whole zoo of extras.

Option 2: all in one

Extras:

Pros: simple Cons: test runners must install everything.

henryiii commented 2 weeks ago

FYI, there's discussion on "docs" vs. "doc" happening on a SPEC - it turn out, "doc" is the standard for this, while "docs" is ~3x more popular. "test" is the standard for testing deps, and also the most popular (vs. "test"). https://packaging.python.org/en/latest/specifications/core-metadata/#provides-extra-multiple-use

Anyway, I strongly disagree with option 2, as it's incorrect. It makes the environment hard to solve, sometimes impossible to solve. Solvers are very sensitive to the number of dependencies in an environment. There was a time when you couldn't install black and tensorflow in the same environment due to dependency conflicts. I've seen solvers choke for minutes[^1] trying to solve updates on massive "all-in-one" environments. And you never need to install your code formatter, or pre-commit, etc. in the same environment as your package, they are stand-alone tools!

Personally, removing [dev] and never even hinting that installing tools into a single environment would be fine by me.

uv is currently working on a solution for project-level "pipx"-like installs.

[^1]: On average. The record I've seen was 25 hours.

henryiii commented 2 weeks ago

Also, this is a reason to use task runners. They handle all of this and keep a user from trying to make a "all-in-one" environment.

hollandjg commented 2 weeks ago

FYI, there's discussion on "docs" vs. "doc" happening on a SPEC - it turn out, "doc" is the standard for this, while "docs" is ~3x more popular. "test" is the standard for testing deps, and also the most popular (vs. "test"). https://packaging.python.org/en/latest/specifications/core-metadata/#provides-extra-multiple-use

Was not aware of that, thanks for the heads up!

hollandjg commented 2 weeks ago

... you never need to install your code formatter, or pre-commit, etc. in the same environment as your package, they are stand-alone tools! Also, this is a reason to use task runners. They handle all of this and keep a user from trying to make a "all-in-one" environment.

I think this is the best argument for task runners, and I think I'm starting to understand some more of the decisions you made when writing this lesson.

It seems to me that one of the functions of the pyproject.toml file is as a single source of truth for project dependencies – you have the build dependencies in there too (though in a separate section) – and many tool configurations – pytest, ruff etc. So implicitly, the dependencies are in there, but if we only declare them as dependencies (and install them) in the task runner script, that feels surprising.

It seems like the pyproject.toml needs a standard way of declaring separate environments distinct from the package extras, so that you don't have to rewrite the configuration each time you decide to change your toolset. (Discussion here https://discuss.python.org/t/adding-a-non-metadata-installer-only-dev-dependencies-table-to-pyproject-toml/20106)

BTW, I'm not planning on attempting to merge any of the changes around the task runners and environment set up until we've reached consensus here! Thanks for your patience and all your comments!

henryiii commented 2 weeks ago

There are task runners that use pyproject.toml (specifically, hatch, which is becoming very good). But having a Python based runner is really great for two reasons: first, it's just running commands from Python, so very little DSL to learn and clearly displays what it is doing, and second, if you need to put any logic in, writing it in Python vs. a hobbled "shell"-like configuration syntax is far better.

If you want to move to hatch, I'd not be against that, I've moved repo-review to hatch, and am currently using it for the Princeton NVIDIA hackathon, so getting a feel for how people who've never used any tool like it like it. It's not as flexible as nox for complex tasks, and its built-in defaults are interesting. It really does its own thing with hatch fmt unless you tell it not to, and hatch test adds a bunch of pytest plugins but makes running with them really easy. The next minor release will support auto-installing the [test] extra too. Some sort of doc or docs command is coming soon, and it will either be based on mkdocs or be selectable.

It seems like the pyproject.toml needs a standard way of declaring separate environments

I think you might be thinking of PEP 735.

Though I think the current situation isn't bad: you add test and doc(s) extras. In GitHub Actions and Nox and readthedocs, you install with the extras. Then you just need pre-commit, which does environments for you for most of your tools, like ruff, etc. Scripts can use PEP 723. That's 90% of the way there, and there are just a few advanced cases that need manual setup in the noxfile.py that aren't in pyproject.toml, like pylint if you want to use that.

uv is also activity working on solutions to this, based on what they'v learned with rye and from the community.

we only declare them as dependencies

Do you have an example of something only declared in a noxfile in the lesson?

henryiii commented 2 weeks ago

We also should probably have a little section on conda pointing people at pixi.

hollandjg commented 2 weeks ago

Do you have an example of something only declared in a noxfile in the lesson?

There's only a small mismatch:

hollandjg commented 2 weeks ago

I think you might be thinking of PEP 735.

If I had known about PEP 735, I might have thought of it. πŸ’―

henryiii commented 2 weeks ago

I think the mismatch could be improved. I think the idea was the first time it's shown, a minimal amount was being introduced; optional-dependencies haven't been shown yet. But maybe a callout with the fix below that would help?

hollandjg commented 2 weeks ago

Yes, maybe that's the cleanest – we can put in a note in the metadata episode once the optional dependencies are introduced saying "now modify your task runner config to use the optional dependency."

Another option might be to move the section on task runners until after the metadata have been introduced – so we cover the fundamental basics of the pyproject.toml file including optional-dependencies and only then add the task runner. What would you say to that?

I agree that from a pedagogical point of view it might be too much to go straight ahead and introduce the optional dependencies the first time we have to install something for testing.

My initial impulse is to say:

... and then we start introducing:

Ideally, RSEs would start by creating an environment and then starting work, but I think that realistically a lot of early career RSEs will start with the code and then tack on the other stuff afterwards.

On a related note, I think part of the tension with #12 is that I feel that people might be starting with the code and not know anything about environments yet, so I want to meet them there before we get into the motivations around the environment, but perhaps people do already know about environments but might not have a strong understanding yet, in which case it's better to start with the environments and have the whole lesson tell the story of "the clean way" to do a python project, starting with an environment and building from there.

henryiii commented 2 weeks ago

But then you have to introduce creating a virtual environment and entering a virtual environment. If you start with task runners, then you have a "test" task, even if you are going to modify it later, but you haven't had to "enter" an environment. And I expect a lot of people want to start with the code, and what they will actually do is start installing things into the system python, which I want to steer people away from. Once you start messing with your system python, it's hard to clean it back up and not break anything. People really like to stick with however they started; if the lesson starts with running things via tasks, they are more likely to think that way then if it start with the "easy" way of doing things.

The thing we settled on before was:

A different pattern, starting with virtual environments, then packages, then task runners, isn't necessarily bad either (and it's how https://learn.scientific-python.org/development/ is set up), so if you are more comfortable that way, go for it. It's closer to how the tools were originally developed.

Ideally, RSEs would start by ..., but I think that realistically ...

Isn't the point of teaching someone something to change what they do? If we only do things the "easy" way and the way people are most likely to be familiar with already, are we really teaching anything?

It's like saying "we should teach pip install since everyone is using it already, and not teach pipx install". Which, to be fair, is exactly what some people do say.

henryiii commented 2 weeks ago

Also, I was thinking an admonition box below the place where the "wrong" way to do it is introduced showing the more advanced way with a link to the future lesson could be useful.

hollandjg commented 1 week ago