Open metazool opened 4 months ago
Did a small rendering of k-means clustering of the plankton embeddings which had visually similar outcomes to the similarity search, this is on the clustering_visualisation
branch.
It's outgrowing a notebook, wondering if streamlit
is the right fit for this rather than shifting to Javascript - @matthewcoole 's demo of retrieval augmented generation document search has similar components (including chromadb) https://github.com/NERC-CEH/embeddings_app/ - either repurpose this or borrow from it
Focus of this is to show naively-minimal output to plankton researchers and enlist their help either in finding flaws, or in refining which path to take is actually useful to them. Should be quite timeboxed, ideally no more than a day, max 2...
Note to self that embeddings_app assumes some data that's generated by methods in discoverability
This shows use of UMAP to do dimensionality reduction on embeddings; which is probably worth trying in the notebook to see if that helps DBSCAN not to see everything as noise
Another note to self that while it's not necessary now, the next visit to this should involve
chromadb
which uses URLs of objects in s3 as identifiers)
Updated the issue title to reflect this has grown some extra dimensions! Come back here after some shared discussion and outline what it is we'd like to show
The work in #5 and #6 serves as a proof of concept of minimal-effort approaches to learning from image collections without undertaking model training or costly labelling; but it's at the edge of what's meant to be a deeper investigation of pipelines and workflows that can apply to related projects - most immediately AMI-system. This Discussion on DataLabs computer vision needs for a combination physical sample / imaging field site shows likely demand.
Putting together a short show-and-tell / demo that can be presented to the Environmental Data Science group and the research group is a nice motivator to draw a line under the low-hanging ML parts, shift focus to architecture choices and cross-project common ground
Of these, 2. needs expanded a bit to become more visually interesting and to probe for areas where the approach is weak. 3. we haven't tried at all, got lost in the wash between pipeline/workflow #9 on the one hand and experimental model choice #10 on the other, but it should be quick to try (DBScan etc)
See also the section on transfer learning / feature extraction in this workshop paper: https://aslopubs.onlinelibrary.wiley.com/doi/full/10.1002/lno.12101#lno12101-sec-0025-title