Drop scivision - Githubissues

SORRY - I thought I could open a PR on my fork and then change the base repository to after the fact. It turns out I can't (as far as I can see) and so I've had a whole conversation with myself over here on the original PR.

I am proposing to drop scivision as a dependency, but to keep an eye on how it develops in case we want to adopt their model-loading protocol in future.

Why?

Bluntly, there are no functional benefits to using scivision, for us, at this point in time.

As far as I can tell the only reason to keep using it is because we support what they are trying to do and want to inject momentum into the project rather than remove it. That is a good reason, but given that we are in the very early stages of a project and don't know quite the direction we want to take it, I think it's reasonable for us to make this decision down the line.

Another reason for waiting is that intake (which scivision uses heavily) is currently going through a full-scale rewrite, and the maintainers say (see docs)

All old “sources”, whether still working or not, should be considered deprecated.

My instinct is to wait and see how this plays out and if/when scivision gets updated. [EDIT: my worry is that it just won't.. see observation at the end of this comment ]

Finally, the single model in the scivision catalogue that we are interested in locks us into a very old version of Python (3.9).

What changes

I forked the cefas/turing model, updated it to Python 3.12 and stripped out all the stuff that either didn't work or that we didn't need.

So at first approximation the only changes in the code should be that instead of loading the pretrained model from the original repo via scivision.load_pretrained_model, we instead just pip install the fork.

I also found that scivision.load_dataset can be trivially replaced by intake.open_catalog (see comment)

I concur with the "remove scivision, for now" take, and here's the longer, philosophical essay version of why: thank you for the PR. I'm happy to merge, waiting to do this in sequence starting before the layout changes.

It was helpful for notebook prototyping and exploration, but ended up brittle to work with (the older python dependency; the requirement for certain EXIF headers to be present to use the data loading code in the model)
Though its central repo still looks relatively active it's not clear what the roadmap is for longer term preservation / continuous update. It's a conversation to try having with Turing Inst folks during RSECon.
This plankton model is a stopgap, there's new work from Turing/Cefas with a choice of a couple of different architectures to try (release pending on a publication by @noushineftekhari that's due out soon!) as well as options for more "foundation model" setups like #10 that we haven't explored, but could have reuse value well beyond a plankton-specific model. I don't know whether the plan is to package the new model(s) through scivision.
It's because the model is a stopgap that I don't raise any objection to keeping the fork of it in a personal rather than institutional namespace, and otherwise would!

Regarding the sweeping changes to intake though, I thought the scivision codebase and by extension this one was already on the newer Take2. I've got mixed feelings about intake as well; it seems very powerful but almost too generic, in a way that you have to relearn how to configure it slightly every time you work with it? It seemed full of promise at the start as a alternative to STAC for data without a spatio-temporal component (and I'd not come across it other than its connection to scivision).

I found that once i'd constructed a local database with filenames, option for other metadata, in order to store the embeddings, it was a lot more straightforward to keep using that than to refer back to the intake interface
I never managed to persuade intake to consume a whole s3 bucket using a wildcard in a path, or tell whether this was an overoptimistic assumption about what its interface would offer, or a storage permissions configuration issue that i'm not sure i need to understand better. So effectively ended up bypassing intake-xarray in favour of reading the metadata as CSV and the image data separately - and this counter-acts a lot of the benefits of choosing intake.

We do have spatio-temporal metadata available for this project now after #22 - and I honestly can't think of other projects in the image machine learning line of work here that don't- so it could be a lot healthier to look again at the STAC approach, try to keep it standards-oriented, and there are others within CEH whose expertise could be drawn on for an environmental sample STAC extension - some notes and links here https://github.com/NERC-CEH/plankton_ml/issues/4

NERC-CEH / plankton_ml

Drop scivision #24

Why?

What changes