metazool commented 4 months ago

This builds up the demo to show a reasonable proof of concept, for a human viewer, of enough success in extracting image embeddings from the scivision plankton model to keep building experiments on it in the short term.

What's in this

Adds some utility functions to make working with the image embeddings simpler
Changes the way the intake catalogue is written to use the whole untagged image collection, not the subset of labelled ones
Adds a bit more test coverage where it was spotty, moves the tests into the package for the flake8 action to collect them
Adapts the image_embeddings.py script to run through the whole collection through the model
Adds a notebook showing the outcome of similarity search of the embeddings, which looks plausible, and some notes on next steps / link to a related paper working through unsupervised clustering approaches

What isn't in this

Any kind of robust approach to spreading the workload around with Dask, partly the collection isn't big enough to justify it yet, partly i haven't worked with dask enough yet to understand the best approach, and would appreciate any advice on that topic, including from the DevOps folks (see the notes in comments in scripts/image_embeddings.py).

Any exploration of model explainability techniques using the prediction capabilities of the CEFAS model as opposed to using it as a source of embeddings. That's a good place to visit next, before trying any clustering algorithms on the embeddings- step back and observe whether what the model is seeing is properly coherent, whether it aligns with the CEFAS reference data, or whether there are factors like image dimensions giving a false positive impression of these initial results.

A plot of the closest 24 samples to one picked at random

To test

export PYTHONPATH=. or pip install -e . py.test

You may need the right credentials in .env to be able to run the scripts which generate the index and subsequently the embeddings. I've only run this locally, there are 8k images. I should set it up to be able to reproduce this using only the images from @Kzra that are in the test fixtures.

github-actions[bot] commented 4 months ago

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status

98 83 85% 0% 🟢

New Files

File Coverage Status

cyto_ml/data/s3.py 0% 🟢

cyto_ml/tests/test_image_embeddings.py 100% 🟢

cyto_ml/tests/test_vector_store.py 100% 🟢

TOTAL 67% 🟢

Modified Files

File Coverage Status

cyto_ml/data/intake.py 0% 🟢

cyto_ml/data/vectorstore.py 81% 🟢

cyto_ml/models/scivision.py 95% 🟢

TOTAL 59% 🟢

updated for commit: b2d2aa1 by action🐍

Lines	Covered	Coverage	Threshold	Status
98	83	85%	0%	🟢

File	Coverage	Status
cyto_ml/data/s3.py	0%	🟢
cyto_ml/tests/test_image_embeddings.py	100%	🟢
cyto_ml/tests/test_vector_store.py	100%	🟢
TOTAL	67%	🟢

File	Coverage	Status
cyto_ml/data/intake.py	0%	🟢
cyto_ml/data/vectorstore.py	81%	🟢
cyto_ml/models/scivision.py	95%	🟢
TOTAL	59%	🟢

metazool commented 4 months ago

I'm keen to merge this PR, on the understanding that it's still a rough prototype; #8 is not actionable otherwise. Same goes for the next one #7 though I still have doubts about its validity, it was a useful exercise and there are small refactors / improved test coverage along with the inconclusive notebook.

@jmarshrossney / @albags you've both kindly cast eyes on this experiment, prepared to approve it?

metazool commented 4 months ago

Thank you @jmarshrossney much appreciated - and good call on proper dependency pinning, will fix with any upcoming work.

10 (support different models and try BioCLIP for embeddings, if that's runnable without a GPU back) is where I plan to look next, but clearing a documentation backlog first

Kzra commented 4 months ago

Just to add - I changed the object store layout and got rid of the metadata bucket! The object store layout is now:

untagged-images-lana untagged-images-wala tagged-images-lana tagged-images-wala

Inside tagged-images-lana and tagged-images-wala there is a metadata.csv file and taxonomy.csv file.

lana refers to lancaster A, the flow cam images wala refers to wallingford A, Isabelle's flow cytometer images

I've kept the tagged-images and untagged-images buckets for now as I know this repo is using them, but if you could change to using untagged-images-lana and tagged-images-lana, that would be great, once that's done i'll get rid of these outdated buckets.

The reason for doing this is to keep the images and metadata in separate workflows so that Isabelle can start to use the app to tag flow cytometer images.

Hi Jo.

Thanks for all of your work on this project. It's been fun and challenging (in a fun way) to figure out what's going on with the code and the jasmin object storage.

I notice that the script intake_metadata.py is broken on main now since metadata/metadata.csv no longer exists in the object store, which is a symptom of how long it's taken me to review this PR - sorry! So yeah I agree that we should merge this, with a couple of very minor tweaks (see comments) and with a pin in dependency specification which can be sorted in a future PR.

metazool commented 4 months ago

Hi @Kzra ! Thank you for unpacking the changes - hadn't clocked that there was active work on the annotation side of the project.

Just to add - I changed the object store layout and got rid of the metadata bucket!

I notice that the script intake_metadata.py is broken on main now since metadata/metadata.csv no longer exists

I found having a writeable bucket (with the permissions I was offered) useful to store two files that were generated to serve as a catalogue interface, using the intake package. I didn't document this other than in the script above though, and should write something down longhand.

Can we persuade you to adopt a git branch and merge workflow for changes to cyto-app? The EDS RSE group don't currently have a set of recommendations and hints for this, but I can draft some based on the guidance used in a previous team. If you make changes on a branch and merge them into your project via pull request, even if you haven't got someone lined up to peer review, you could tag one of us to offer a heads-up on potentially breaking changes. I'll add this to my list of bootstrap documentation to (re)write

See also https://github.com/NERC-CEH/plankton_ml/issues/12 covering next steps for this part of the project

Kzra commented 4 months ago

Never done this before, but if you write some guidance I can follow the instructions! There will be quite a few changes to the app over the coming weeks as we begin testing in earnest. The object store layout might expand, but the tagged-images-lana etc. buckets won't change or be deleted.

NERC-CEH / plankton_ml

Proof of concept of similarity search with the scivision model #5

What's in this

What isn't in this

To test

☂️ Python Coverage

Overall Coverage

New Files

Modified Files

10 (support different models and try BioCLIP for embeddings, if that's runnable without a GPU back) is where I plan to look next, but clearing a documentation backlog first