Demo / lightning talk for plankton image data flow

metazool commented 4 months ago

Updated the issue title to reflect this has grown some extra dimensions! Come back here after some shared discussion and outline what it is we'd like to show

The work in #5 and #6 serves as a proof of concept of minimal-effort approaches to learning from image collections without undertaking model training or costly labelling; but it's at the edge of what's meant to be a deeper investigation of pipelines and workflows that can apply to related projects - most immediately AMI-system. This Discussion on DataLabs computer vision needs for a combination physical sample / imaging field site shows likely demand.

Putting together a short show-and-tell / demo that can be presented to the Environmental Data Science group and the research group is a nice motivator to draw a line under the low-hanging ML parts, shift focus to architecture choices and cross-project common ground

model choice and overview
image similarity search by vector embeddings
unsupervised clustering approaches to the above

Of these, 2. needs expanded a bit to become more visually interesting and to probe for areas where the approach is weak. 3. we haven't tried at all, got lost in the wash between pipeline/workflow #9 on the one hand and experimental model choice #10 on the other, but it should be quick to try (DBScan etc)

See also the section on transfer learning / feature extraction in this workshop paper: https://aslopubs.onlinelibrary.wiley.com/doi/full/10.1002/lno.12101#lno12101-sec-0025-title

metazool commented 3 months ago

Did a small rendering of k-means clustering of the plankton embeddings which had visually similar outcomes to the similarity search, this is on the clustering_visualisation branch.

It's outgrowing a notebook, wondering if streamlit is the right fit for this rather than shifting to Javascript - @matthewcoole 's demo of retrieval augmented generation document search has similar components (including chromadb) https://github.com/NERC-CEH/embeddings_app/ - either repurpose this or borrow from it

Focus of this is to show naively-minimal output to plankton researchers and enlist their help either in finding flaws, or in refining which path to take is actually useful to them. Should be quite timeboxed, ideally no more than a day, max 2...

metazool commented 3 months ago

Note to self that embeddings_app assumes some data that's generated by methods in discoverability

This shows use of UMAP to do dimensionality reduction on embeddings; which is probably worth trying in the notebook to see if that helps DBSCAN not to see everything as noise

metazool commented 3 months ago

Another note to self that while it's not necessary now, the next visit to this should involve

ease of pointing to a different image collection (it's already all driven from chromadb which uses URLs of objects in s3 as identifiers)
ease of pointing to a collection of different embeddings for the same image sources (whether that's BioCLIP or the more recent model the Turing Inst folks are releasing with the paper from @noushineftekhari ... )

NERC-CEH / plankton_ml

Demo / lightning talk for plankton image data flow #8