NERC-CEH / plankton_ml

A project for image processing and analysis pipelines for plankton sampling
GNU General Public License v3.0
0 stars 1 forks source link

Modify environment.yml #13

Closed jmarshrossney closed 4 months ago

jmarshrossney commented 4 months ago

Overview

This PR makes a small change to environment.yml so that

  1. Only direct dependencies are included
  2. Everything that can be installed using conda install is installed using conda install, instead of pip install
  3. chromadb is pinned to 0.5.3

I also modified the supported python versions in pyproject.toml to reflect the fact that we aren't supporting <=3.8 as far as I can tell.

Finally, I've slightly tweaked the installation instructions in README.md.

Dependencies

Issues like the one we've just encountered with chromadb 0.5.4 are going to occur, and it's in these situations where I always seem to end up rediscovering how brittle conda environments with mixed conda and pip dependencies are - see e.g. this article. Concretely, it took me about 5 attempts to create an environment that passed the tests by manually installing the packages in the existing environment.yml.

Acknowledging that there is a reasonable case for using conda here since there are several non-Python dependencies that we might one day want to exert control over (but like... will we?), I've changed environment.yml so as to use conda install instead of pip wherever possible.

I think the direct dependencies we need are

The rest are I think indirect.

Of the direct dependencies, only scivision and resnet50_cefas cannot be installed using conda install.

So to me a sensible environment.yml looks something like this

name: cyto_39
channels:
  - pytorch
  - conda-forge
  - defaults
dependencies:
  - python=3.9
  - pytorch=1.10.0
  - mkl=2024.0  # see https://github.com/pytorch/pytorch/issues/123097
  - chromadb=0.5.3
  - intake-xarray
  - scikit-image  # intake-xarray dependency
  - pandas
  - pytest
  - python-dotenv
  - s3fs
  - pip
  - pip:
    - scivision
    - git+https://github.com/alan-turing-institute/plankton-cefas-scivision@main

If I then run the following in the root of the repository,

$ conda env create -f environment.yml
$ conda activate cyto_39
$ python -m pip install .
$ python -m pytest

I get 1 failed, 4 passed, 9 warnings in 4.74s where the failed test is just the device mismatch mentioned in #5, and easily fixable.

Notes

I also have channel_priority set to strict in my .condarc, which is now recommended.

Some of the failed attempts resulted in IOError: [Errno 9] Bad file descriptor for reasons I failed to understand.

metazool commented 4 months ago

The changes look good and run clean for me locally.

I'm puzzled by the coverage workflow error. You can see the coverage report generates fine. "Resource not accessible by integration" is ostensibly often a permissions issue, but our project settings show the correct level of read-write

I'm guessing this is because you appear as an outside contributor (though with permissions inherited from group membership) the workflow is failing at the point of posting the coverage report as an issue comment. Can try adding you as a direct contributor, also welcome to make branches and PRs directly in this project rather than via a fork. Happy to approve anyway

metazool commented 4 months ago

Going to merge this on the grounds that the failing check looks like an artefact of unexpected behaviour of proprietary CI, can always back out if that's somehow not the case