StatCan / datascience-cookiecutter

A Cookiecutter template for Data Science Projects in Python
MIT License
7 stars 1 forks source link

Another solution to multiple version files without creating top-level packages. #30

Closed asolis closed 2 years ago

asolis commented 2 years ago
  1. Another solution to handle multiple files containing the project version number.
  2. It also makes sure that the Makefile is used without installing any package.
  3. Excluding conf.py from flake8.

Instead of creating a top-level package containing an __init__.py file containing the package version we create a version text file at the project root containing the version number. The configuration file setup.cfg`` andMakefile``` have accessed this file to provide the version number from the file to sphinx and package building.

I also added a few configurations that I believe should be set by default.

If this solution is approved, the other pull request #29 should be discarded. Just for reference #29 provides a similar solution but assumes that we will always have a top-level package folder initialized at creation.

@ToucheSir

goatsweater commented 2 years ago

If the path forward is a separate file I believe common practice is a __version__.py file that lives alongside the rest of the code. This file can then be imported or read from other places.

Potentially useful way of dealing with versions is to copy the sys package and provide both a version_info tuple and a version string (just a concatenation of the tuple).

ToucheSir commented 2 years ago

Ok, this turned out to be trickier than I thought. The main mismatch is that setuptools is configured to search for an arbitrary number of packages under src, but Sphinx as configured assumes a single version for the entire repo. The open source projects I know of haven't been very helpful because they follow a 1 package per repo, no src dir model, while the UK cookiecutter doesn't use setuptools.

For scenarios where we could guarantee a 1-1 correspondence, https://stackoverflow.com/a/60430731 looks promising. In short, have packages source __version__ from the config instead of the other way around.

asolis commented 2 years ago

Ok, this turned out to be trickier than I thought. The main mismatch is that setuptools is configured to search for an arbitrary number of packages under src, but Sphinx as configured assumes a single version for the entire repo. The open source projects I know of haven't been very helpful because they follow a 1 package per repo, no src dir model, while the UK cookiecutter doesn't use setuptools.

For scenarios where we could guarantee a 1-1 correspondence, https://stackoverflow.com/a/60430731 looks promising. In short, have packages source __version__ from the config instead of the other way around.

This and PR #29 follows two out of the three solutions mentioned in stackoverflow issue pointed out here. it all comes down to if you want to provide a top level package or not. Top level package will assume python as programming language and I say that can be an initial step to just create one as starting point . Any other requirement could be just mentioned in doc as guidelines.

No top level package is a more general solución but an extra text file is the easiest solution. Just documenting how to extend it should be necessary.

No matter what you chose it won’t accommodate all the escenarios . For one hand , for hydra modules , I will create a namespace package structure. i will allow multiple projects to develop submodules of a top level package : hydra.modules.[project_name]. Or dsd.hydra.modules.[project_name] (still haven’t decided the top namespace) but in any case my “version” file will be pointing to: dsd.hydra.modules.[project_name].version. I don’t want you to support this in particular because knowing which will be the top level package or sub package that I want to version is not trivial and different for everyone.

I think it’s a good start if you asume a 1to1 and leave the developer to point it to a different module ? Or just use an extra file or metadata to setup the correct module version. For multiple submodules leave it to developer. I particularly will allow ppl to create multiple projects for each submodule of dsd.hydra.modules, making it one more time 1 to 1 .

I can elaborate more if it wasn’t clear .

ToucheSir commented 2 years ago

29 is similar, but the direction of the dependency arrow is reversed. What I'm describing here is __version__.py reading the version from metadata stored in setup.cfg, which itself can either choose to have it inline or read it from an external source like the VERSION file in this PR.

But that part is not terribly important to me. I think the bigger question is what the version selector in the rendered sphinx docs should show for each of the following scenarios:

  1. A project with a single package. Think Numpy.
  2. A project with multiple packages, but which maintains identical versions for each package.
  3. A project with multiple packages, where package versions are not synchronized. i.e. dsd.foo is 1.1 while dsd.bar is 1.2.

(auto-generated docs for R projects is its own can of worms that I'll leave off the table for now)

I'm not sure if Hydra falls under 2) or 3), but I presume we will have to support all of these workflows at some point. The path of least resistance is to use the short/abbreviated commit hash as I believe the UK cookiecutter does, but I imagine there are potential UX concerns there.

asolis commented 2 years ago

Hydra will be similar to point number 3, but the same project will not contain the multiple packages.

A project hydra-module-ocr will only contain the OCR sub-package: dsd.hydra.module.ocr and will keep only one version dsd.hydra.module.ocr.__version__. All code in this project will be under only one version. (The one stored in the submodule) Another project hydra-module-tabular will contain another sub-package: dsd.hydra.module.tabular, version file under dsd.hydra.module.tabular.__version__. All code in this project will be under only one version. (The one stored in the submodule).

In my opinion, as a starting point template, you should do a project containing only one version for all the packages and sub-packages. Point 2.

ToucheSir commented 2 years ago

So just to clarify, would you create a single docs site for dsd.hydra.module.ocr and dsd.hydra.module.tabular? Or are each of those getting their own Git(Hub|Lab) pages? If the former, would the version selector show dsd.hydra.module.ocr.__version__, dsd.hydra.module.tabular.__version__ or neither?

asolis commented 2 years ago

Each of them will be getting their own Git(Hub|Lab) pages. Each module will be developed independently of the other. The version selection for each project will show dsd.hydra.module.ocr.__version__ and dsd.hydra.module.tabular.__version__ respectively.

ToucheSir commented 2 years ago

Ok, sounds like an interesting monorepo layout. I'll be interested to see what shape it takes :)

asolis commented 2 years ago

Following a similar structure to this:

hydra-module-a/
    setup.cfg
    src/dsd/hydra/module/
        subpackage_a/
            __init__.py

hydra-module-b/
    setup.cfg
    src/dsd/hydra/module/
        subpackage_b/
            __init__.py

Each sub-package can now be separately installed, used, and versioned. I already have an initial setup for testing here: -https://gitlab.com/dsd4/hydra

asolis commented 2 years ago

*hydra-module-template will be a fork (now it’s just disconnected). Each of his children project will be connected with cruft to sync changes from upstream.

Btw, @goatsweater , cookiecutter has a release version 2.1+ (the only feature that I would like to use is private variables) but the latest version of cruft only supports cookiecutter > 1.6 and <2.0. I like cruft because of its simplicity but maybe another tool can provide better support to latest versions of cookiecutter.

goatsweater commented 2 years ago

Thinking about how the cloud native team built out their default CI pipeline they use git tags as a versioning mechanism. Not that we can't alter or even use a completely different template. My point is merely that so many things are set up to assume a single project per repo that I don't think we're introducing undue burden by defaulting to the same assumptions.

Another thing I hadn't thought of is that __version__.py is Python specific, where a straight version text file could be used by R scripts as well (I think - haven't tested it). Using the method pointed out in the StackOverflow issue to have setup.cfg read in the version file and other python code import the metadata at run time seems like the most flexible solution in my mind.

goatsweater commented 2 years ago

Came across some guidance on having a single top level package over multiple: https://peps.python.org/pep-0423/#use-a-single-name.

@ToucheSir I think this further supports that while setup.cfg will support finding multiple packages in the current setup, that is a coincidence and we can cater to the single package route.