StatCan / datascience-cookiecutter

A Cookiecutter template for Data Science Projects in Python
MIT License
7 stars 1 forks source link

Replace Makefile with cross platform solution #65

Open goatsweater opened 11 months ago

goatsweater commented 11 months ago

We currently have a Makefile to enable users to perform certain actions easier than having to do certain repetitive steps manually. Six make targets are defined currently (plus help). Unfortunately, make is not available on Windows, which is one of the primary places the cookiecutter gets used.

In certain teams there has been a push to move towards task files. While this solves the cross platform support issue, getting access to the task tooling remains an issue on corporate desktops. pydoit seems to solve the problems of cross platform support while maintaining a relatively straight forward interface and allowing for equivalent functionality as the existing Makefile.

ToucheSir commented 10 months ago

Another tool I've seen mentioned but never personally used is Snakemake. Similar idea to pydoit, so I guess it comes down to what's most convenient for the cookiecutter.

goatsweater commented 10 months ago

It looks like a cool tool, and widely used in the genomics field from the looks of it. Saw this in the install instructions though:

Instead of conda, snakemake can be installed with pip. However, note that snakemake has non-python dependencies, such that the pip based installation has a limited functionality if those dependencies are not manually installed in addition.

From skimming the condo dependencies it looks like maybe this relates to external storage services like dropbox that we won't use, but we'd need to look into that more to be sure.

ToucheSir commented 10 months ago

Good to know. I notice the syntax isn't quite Python either, which would be a bit of a learning curve for Python programmers to pick up. Whether that divergence brings any benefits I'm not sure.

goatsweater commented 10 months ago

In an attempt to inform the discussion I tried to use both options from a virtual environment on AVD with Python 3.10, recreating only a small subset of the overall makefile contents via copy/paste. I'm not convinced copy/paste is the best implementation in either tool, but was easiest to do.

snakemake

Unable to use/test due to missing C++ compiler to build dependencies.

pip install snakemake

...
Successfully built snakemake connection-pool stopit
Failed to build datrie
ERROR: Could not build wheels for datrie, which is required to install pyproject.toml-based projects

doit

Installed via pip install doit.

dodo.py (Makefile equivalent) contents :

"""doit automation tasks."""

def task_requirements():
    """update conda environment based on environment.yml"""
    return {
        "actions": ["conda env update --prune --file environment.yml"]
    }

def task_pre_commit():
    """Install pre-commit hooks in git."""
    return {
        "actions": ["pre-commit install"],
        "file_dep": [".git/hooks/pre-commit"]
    }

def task_messages():
    """Extract translatable messages for translation team."""
    return {
        "actions": ["sphinx-build -M gettext ./docs", "cd ./docs", "sphinx-intl update -p _build/gettext -l fr"]
    }

doit execution

> doit list
WARNING: File "pyproject.toml" might contain doit configuration,but a TOML parser is not available.
        Please install one of: tomllib, tomli, tomlkit.
messages       Extract translatable messages for translation team.
pre_commit     Install pre-commit hooks in git.
requirements   update conda environment based on environment.yml

Side effects

These would need to be ignored by git. doit uses hashes to trace dependencies vs make's existence checks.

-a--- 11/1/2023 1:39 PM 0 .doit.db.dat -a--- 11/1/2023 1:39 PM 0 .doit.db.dir

ToucheSir commented 10 months ago

The pure Python-ness of doit is making a pretty compelling case here. In my experience with other tools, hash-based tracking can be hit or miss for data files, but for source code and smaller resource files it works pretty well.

goatsweater commented 10 months ago

I agree, the hashing isn't my favourite part. I do like that doit makes it easy to override the hashs and run the task. Also just not listing dependencies makes things run all the time, but doit will still create its stub DBs.

One benefit I find it doit is that because it is pure python you don't have to shell out all the time. If there's a python function you want to call you can just tell it to call that, and given that a lot of what we want to do is either file system or part of a python package it seems like it should be easy to find non-shell alternatives and thus ensure greater system portability.

We currently expect the user to sort out getting access to make, but I think if we're going to adopt something we can install on their behalf we should. For example, adding doit to the base environment file so it becomes immediately available to them on setup.

asolis commented 8 months ago

Chipping in the conversation: I looked at the documentation and samples for doit. The current use cases of the Makefile can be translated to doit. We basically use the Makefile for grouping and automate scripts with few tasks and file dependencies. Doit implement both type of dependencies: task and files. Doit uses Popen to execute shell commands as a subprocess; everything that can be executed in the host OS terminal will work. If the CLI uses the same signature and parameters in different OS, then it will execute. This means, that some of the current script logic in the Makefile has to be replaced with python.

Agree with both that hashing and specially md5 not the best solution, but they also provide a way to replace it and implement your own if needed with the attribute "uptodate".

I also looked to other Cmake equivalents; all of them have a learning curve. They need to install separated system tools for different OS. The same problem of CMake and Windows. Doit is generic, "simple" to use, well documented, and in python (a dependency that's already met as part of the project).

The upside:

The downside:

It all comes down to if you would like to adapt it instead of CMake to remove the burden of asking CMake in Windows and favor a more self-python dependency.