Closed jmarshrossney closed 3 months ago
I concur with the "remove scivision, for now" take, and here's the longer, philosophical essay version of why: thank you for the PR. I'm happy to merge, waiting to do this in sequence starting before the layout changes.
scivision
.Regarding the sweeping changes to intake
though, I thought the scivision codebase and by extension this one was already on the newer Take2
. I've got mixed feelings about intake
as well; it seems very powerful but almost too generic, in a way that you have to relearn how to configure it slightly every time you work with it? It seemed full of promise at the start as a alternative to STAC for data without a spatio-temporal component (and I'd not come across it other than its connection to scivision
).
intake
interfaceintake
to consume a whole s3 bucket using a wildcard in a path, or tell whether this was an overoptimistic assumption about what its interface would offer, or a storage permissions configuration issue that i'm not sure i need to understand better. So effectively ended up bypassing intake-xarray
in favour of reading the metadata as CSV and the image data separately - and this counter-acts a lot of the benefits of choosing intake
.We do have spatio-temporal metadata available for this project now after #22 - and I honestly can't think of other projects in the image machine learning line of work here that don't- so it could be a lot healthier to look again at the STAC approach, try to keep it standards-oriented, and there are others within CEH whose expertise could be drawn on for an environmental sample STAC extension - some notes and links here https://github.com/NERC-CEH/plankton_ml/issues/4
SORRY - I thought I could open a PR on my fork and then change the base repository to after the fact. It turns out I can't (as far as I can see) and so I've had a whole conversation with myself over here on the original PR.
I am proposing to drop
scivision
as a dependency, but to keep an eye on how it develops in case we want to adopt their model-loading protocol in future.Why?
Bluntly, there are no functional benefits to using scivision, for us, at this point in time.
As far as I can tell the only reason to keep using it is because we support what they are trying to do and want to inject momentum into the project rather than remove it. That is a good reason, but given that we are in the very early stages of a project and don't know quite the direction we want to take it, I think it's reasonable for us to make this decision down the line.
Another reason for waiting is that
intake
(whichscivision
uses heavily) is currently going through a full-scale rewrite, and the maintainers say (see docs)My instinct is to wait and see how this plays out and if/when
scivision
gets updated. [EDIT: my worry is that it just won't.. see observation at the end of this comment ]Finally, the single model in the scivision catalogue that we are interested in locks us into a very old version of Python (3.9).
What changes
I forked the cefas/turing model, updated it to Python 3.12 and stripped out all the stuff that either didn't work or that we didn't need.
So at first approximation the only changes in the code should be that instead of loading the pretrained model from the original repo via
scivision.load_pretrained_model
, we instead justpip install
the fork.I also found that
scivision.load_dataset
can be trivially replaced byintake.open_catalog
(see comment)