Closed metazool closed 4 months ago
current status: ✅
Overall Coverage
Lines Covered Coverage Threshold Status 98 83 85% 0% 🟢 New Files
File Coverage Status cyto_ml/data/s3.py 0% 🟢 cyto_ml/tests/test_image_embeddings.py 100% 🟢 cyto_ml/tests/test_vector_store.py 100% 🟢 TOTAL 67% 🟢 Modified Files
File Coverage Status cyto_ml/data/intake.py 0% 🟢 cyto_ml/data/vectorstore.py 81% 🟢 cyto_ml/models/scivision.py 95% 🟢 TOTAL 59% 🟢 updated for commit:
b2d2aa1
by action🐍
I'm keen to merge this PR, on the understanding that it's still a rough prototype; #8 is not actionable otherwise. Same goes for the next one #7 though I still have doubts about its validity, it was a useful exercise and there are small refactors / improved test coverage along with the inconclusive notebook.
@jmarshrossney / @albags you've both kindly cast eyes on this experiment, prepared to approve it?
Thank you @jmarshrossney much appreciated - and good call on proper dependency pinning, will fix with any upcoming work.
Just to add - I changed the object store layout and got rid of the metadata bucket! The object store layout is now:
untagged-images-lana untagged-images-wala tagged-images-lana tagged-images-wala
Inside tagged-images-lana and tagged-images-wala there is a metadata.csv file and taxonomy.csv file.
lana refers to lancaster A, the flow cam images wala refers to wallingford A, Isabelle's flow cytometer images
I've kept the tagged-images and untagged-images buckets for now as I know this repo is using them, but if you could change to using untagged-images-lana and tagged-images-lana, that would be great, once that's done i'll get rid of these outdated buckets.
The reason for doing this is to keep the images and metadata in separate workflows so that Isabelle can start to use the app to tag flow cytometer images.
Hi Jo.
Thanks for all of your work on this project. It's been fun and challenging (in a fun way) to figure out what's going on with the code and the jasmin object storage.
I notice that the script
intake_metadata.py
is broken onmain
now sincemetadata/metadata.csv
no longer exists in the object store, which is a symptom of how long it's taken me to review this PR - sorry! So yeah I agree that we should merge this, with a couple of very minor tweaks (see comments) and with a pin in dependency specification which can be sorted in a future PR.
Hi @Kzra ! Thank you for unpacking the changes - hadn't clocked that there was active work on the annotation side of the project.
Just to add - I changed the object store layout and got rid of the metadata bucket!
I notice that the script
intake_metadata.py
is broken onmain
now sincemetadata/metadata.csv
no longer exists
I found having a writeable bucket (with the permissions I was offered) useful to store two files that were generated to serve as a catalogue interface, using the intake
package. I didn't document this other than in the script above though, and should write something down longhand.
Can we persuade you to adopt a git branch and merge workflow for changes to cyto-app
? The EDS RSE group don't currently have a set of recommendations and hints for this, but I can draft some based on the guidance used in a previous team. If you make changes on a branch and merge them into your project via pull request, even if you haven't got someone lined up to peer review, you could tag one of us to offer a heads-up on potentially breaking changes. I'll add this to my list of bootstrap documentation to (re)write
See also https://github.com/NERC-CEH/plankton_ml/issues/12 covering next steps for this part of the project
Never done this before, but if you write some guidance I can follow the instructions! There will be quite a few changes to the app over the coming weeks as we begin testing in earnest. The object store layout might expand, but the tagged-images-lana etc. buckets won't change or be deleted.
This builds up the demo to show a reasonable proof of concept, for a human viewer, of enough success in extracting image embeddings from the
scivision
plankton model to keep building experiments on it in the short term.What's in this
intake
catalogue is written to use the whole untagged image collection, not the subset of labelled onesflake8
action to collect themimage_embeddings.py
script to run through the whole collection through the modelWhat isn't in this
Any kind of robust approach to spreading the workload around with Dask, partly the collection isn't big enough to justify it yet, partly i haven't worked with dask enough yet to understand the best approach, and would appreciate any advice on that topic, including from the DevOps folks (see the notes in comments in
scripts/image_embeddings.py
).Any exploration of model explainability techniques using the prediction capabilities of the CEFAS model as opposed to using it as a source of embeddings. That's a good place to visit next, before trying any clustering algorithms on the embeddings- step back and observe whether what the model is seeing is properly coherent, whether it aligns with the CEFAS reference data, or whether there are factors like image dimensions giving a false positive impression of these initial results.
To test
export PYTHONPATH=.
orpip install -e .
py.test
You may need the right credentials in
.env
to be able to run the scripts which generate the index and subsequently the embeddings. I've only run this locally, there are 8k images. I should set it up to be able to reproduce this using only the images from @Kzra that are in the test fixtures.