check and unify use of extras

hollandjg commented 2 weeks ago

we've a mixture of test, docs, dev extras in the episodes.

We need to be able to deal with:

Just installing test frameworks for running on GitHub actions as quickly as possible
Installing pre-commit, documentation tools and testing tools in one go for developers.

Option 1: Modular

Extras:

test with pytest
docs with mkdocs, mkdocstrings or sphinx etc.
dev with package[test, docs] and pre-commit

Pros: each piecce is compartmentalized Cons: Users need to understand a whole zoo of extras.

Option 2: all in one

Extras:

dev with pytest, mkdocs, pre-commit ... everything

Pros: simple Cons: test runners must install everything.

henryiii commented 2 weeks ago

FYI, there's discussion on "docs" vs. "doc" happening on a SPEC - it turn out, "doc" is the standard for this, while "docs" is ~3x more popular. "test" is the standard for testing deps, and also the most popular (vs. "test"). https://packaging.python.org/en/latest/specifications/core-metadata/#provides-extra-multiple-use

Anyway, I strongly disagree with option 2, as it's incorrect. It makes the environment hard to solve, sometimes impossible to solve. Solvers are very sensitive to the number of dependencies in an environment. There was a time when you couldn't install black and tensorflow in the same environment due to dependency conflicts. I've seen solvers choke for minutes[^1] trying to solve updates on massive "all-in-one" environments. And you never need to install your code formatter, or pre-commit, etc. in the same environment as your package, they are stand-alone tools!

Personally, removing [dev] and never even hinting that installing tools into a single environment would be fine by me.

uv is currently working on a solution for project-level "pipx"-like installs.

[^1]: On average. The record I've seen was 25 hours.

henryiii commented 2 weeks ago

Also, this is a reason to use task runners. They handle all of this and keep a user from trying to make a "all-in-one" environment.

hollandjg commented 2 weeks ago

FYI, there's discussion on "docs" vs. "doc" happening on a SPEC - it turn out, "doc" is the standard for this, while "docs" is ~3x more popular. "test" is the standard for testing deps, and also the most popular (vs. "test"). https://packaging.python.org/en/latest/specifications/core-metadata/#provides-extra-multiple-use

Was not aware of that, thanks for the heads up!

hollandjg commented 2 weeks ago

... you never need to install your code formatter, or pre-commit, etc. in the same environment as your package, they are stand-alone tools! Also, this is a reason to use task runners. They handle all of this and keep a user from trying to make a "all-in-one" environment.

I think this is the best argument for task runners, and I think I'm starting to understand some more of the decisions you made when writing this lesson.

It seems to me that one of the functions of the pyproject.toml file is as a single source of truth for project dependencies – you have the build dependencies in there too (though in a separate section) – and many tool configurations – pytest, ruff etc. So implicitly, the dependencies are in there, but if we only declare them as dependencies (and install them) in the task runner script, that feels surprising.

Is there a better way you know of, of handling this without using an additional tool, just using the standard fields of the pyproject.toml?
If we decide to allow adding an additional tool, is there a reason to avoid something like hatch environments with their own separate dependencies?

It seems like the pyproject.toml needs a standard way of declaring separate environments distinct from the package extras, so that you don't have to rewrite the configuration each time you decide to change your toolset. (Discussion here https://discuss.python.org/t/adding-a-non-metadata-installer-only-dev-dependencies-table-to-pyproject-toml/20106)

BTW, I'm not planning on attempting to merge any of the changes around the task runners and environment set up until we've reached consensus here! Thanks for your patience and all your comments!

henryiii commented 2 weeks ago

There are task runners that use pyproject.toml (specifically, hatch, which is becoming very good). But having a Python based runner is really great for two reasons: first, it's just running commands from Python, so very little DSL to learn and clearly displays what it is doing, and second, if you need to put any logic in, writing it in Python vs. a hobbled "shell"-like configuration syntax is far better.

If you want to move to hatch, I'd not be against that, I've moved repo-review to hatch, and am currently using it for the Princeton NVIDIA hackathon, so getting a feel for how people who've never used any tool like it like it. It's not as flexible as nox for complex tasks, and its built-in defaults are interesting. It really does its own thing with hatch fmt unless you tell it not to, and hatch test adds a bunch of pytest plugins but makes running with them really easy. The next minor release will support auto-installing the [test] extra too. Some sort of doc or docs command is coming soon, and it will either be based on mkdocs or be selectable.

It seems like the pyproject.toml needs a standard way of declaring separate environments

I think you might be thinking of PEP 735.

Though I think the current situation isn't bad: you add test and doc(s) extras. In GitHub Actions and Nox and readthedocs, you install with the extras. Then you just need pre-commit, which does environments for you for most of your tools, like ruff, etc. Scripts can use PEP 723. That's 90% of the way there, and there are just a few advanced cases that need manual setup in the noxfile.py that aren't in pyproject.toml, like pylint if you want to use that.

uv is also activity working on solutions to this, based on what they'v learned with rye and from the community.

we only declare them as dependencies

Do you have an example of something only declared in a noxfile in the lesson?

henryiii commented 2 weeks ago

We also should probably have a little section on conda pointing people at pixi.

hollandjg commented 2 weeks ago

Do you have an example of something only declared in a noxfile in the lesson?

There's only a small mismatch:

The first time the noxfile.py is declared in https://github.com/INTERSECT-training/packaging/blob/main/_episodes/02-jupyter-to-package.md, we have the line session.install("numpy", "pytest").
That's implicitly changed in https://github.com/INTERSECT-training/packaging/blob/main/_episodes/04-metadata.md where dependencies = ["numpy>=1.18",] and [project.optional-dependenices]; test = ["pytest>=6"] are added to pyproject.toml but I don't see that the original explicit session.install gets updated anywhere. (I may be missing something though).

hollandjg commented 2 weeks ago

I think you might be thinking of PEP 735.

If I had known about PEP 735, I might have thought of it. 💯

henryiii commented 2 weeks ago

I think the mismatch could be improved. I think the idea was the first time it's shown, a minimal amount was being introduced; optional-dependencies haven't been shown yet. But maybe a callout with the fix below that would help?

hollandjg commented 2 weeks ago

Yes, maybe that's the cleanest – we can put in a note in the metadata episode once the optional dependencies are introduced saying "now modify your task runner config to use the optional dependency."

Another option might be to move the section on task runners until after the metadata have been introduced – so we cover the fundamental basics of the pyproject.toml file including optional-dependencies and only then add the task runner. What would you say to that?

I agree that from a pedagogical point of view it might be too much to go straight ahead and introduce the optional dependencies the first time we have to install something for testing.

My initial impulse is to say:

"here's your minimal setup to make a package: the code in the src directory, minimum project metadata in pyproject.toml with the numpy dependency, a test in the tests directory which you can run once you've manually installed pytest in the venv you've created but don't yet understand."

... and then we start introducing:

"why to keep your environment clean using venv",
"how to make it reproducible using a task runner like nox or hatch",
"how to manage dependencies using dependencies and optional-dependencies and how they interact with task runners"
"what other files your repo will benefit from like CITATION.cff, LICENSE etc."
and so on.

Ideally, RSEs would start by creating an environment and then starting work, but I think that realistically a lot of early career RSEs will start with the code and then tack on the other stuff afterwards.

On a related note, I think part of the tension with #12 is that I feel that people might be starting with the code and not know anything about environments yet, so I want to meet them there before we get into the motivations around the environment, but perhaps people do already know about environments but might not have a strong understanding yet, in which case it's better to start with the environments and have the whole lesson tell the story of "the clean way" to do a python project, starting with an environment and building from there.

henryiii commented 2 weeks ago

But then you have to introduce creating a virtual environment and entering a virtual environment. If you start with task runners, then you have a "test" task, even if you are going to modify it later, but you haven't had to "enter" an environment. And I expect a lot of people want to start with the code, and what they will actually do is start installing things into the system python, which I want to steer people away from. Once you start messing with your system python, it's hard to clean it back up and not break anything. People really like to stick with however they started; if the lesson starts with running things via tasks, they are more likely to think that way then if it start with the "easy" way of doing things.

The thing we settled on before was:

Here's how you run Python tasks. You don't need a package for this with nox.
Here's how you make a minimal setup for a package.
Here's how write a task to test your package. You don't need optional dependencies initially.
Etc.

A different pattern, starting with virtual environments, then packages, then task runners, isn't necessarily bad either (and it's how https://learn.scientific-python.org/development/ is set up), so if you are more comfortable that way, go for it. It's closer to how the tools were originally developed.

Ideally, RSEs would start by ..., but I think that realistically ...

Isn't the point of teaching someone something to change what they do? If we only do things the "easy" way and the way people are most likely to be familiar with already, are we really teaching anything?

It's like saying "we should teach pip install since everyone is using it already, and not teach pipx install". Which, to be fair, is exactly what some people do say.

henryiii commented 2 weeks ago

Also, I was thinking an admonition box below the place where the "wrong" way to do it is introduced showing the more advanced way with a link to the future lesson could be useful.

hollandjg commented 1 week ago

[ ] Add admonition below where we do the initial session.install in the noxfile.py pointing out that we're going to change it later.

INTERSECT-training / packaging

check and unify use of extras #67

Option 1: Modular

Option 2: all in one