Brainstorming pipelines to achieve ground beetle identification using AI/ML models

iaviney commented 2 months ago

Hi all,

In this issue I am proposing a potential pipeline for ground beetle identification. At the end of the pipeline, I propose a plan for analyzing the image data in the Beetlepalooza 2024 dataset. Below that, I pose some questions and suggest some potential figures/supplementary data that we could make for a pipeline like this one. I do not have experience with AI/ML models, so please feel free to correct my interpretation of the steps, add steps, suggest improvements, etc. I'd appreciate any advice or corrections. I found most of the resources below from this document.

=================================================

Steps to develop a pipeline that takes beetle samples from imaging to species identification:

1. Take images of beetles

Taking images of beetle individuals from multiple perspectives may better inform the algorithm or allow for increased phenotype extraction. For example, D03 NEON technicians use the enlarged trochanters on the underside of the beetle to identify them as being part of Family Carabidae.
To train their model, EB-Net used images of multiple beetle body parts (e.g. head, elytra, pronotum, mandibles) for individual beetles
Will the AI/ML models be affected by how close-up the image is?
According to ML-morph, lower resolution images may increase the error rate.
According to ML-morph, the ML classifier accuracy will be best for test sets with the same imaging setup (similar background and resolution) as the training images.

2. Associate a species ID with each image

Perhaps we want to associate the hierarchical taxonomies like with BioCLIP? Perhaps just species? Perhaps just Carabidae and non-Carabidae?
The associated ID may depend on how accurate we can make our model.

3. Curate the image data to include markers, segmentation, and/or morphological annotations

I am less sure of how to go about this step
Both EB-Net and ML-morph first manually position landmarks on their images to train and test a ML model that is able to predict the placement of landmarks in subsequent image datasets
ML-morph creates a model that is able to predict the location of landmarks and the shape produced by the landmarks
Perhaps we want to perform segmentation to measure morphological phenotypes of the beetles -- e.g. binarize the image, use a tool like ImageJ's watershed to separate out the thorax from abdomen, use a tool like MATLAB's erosion to remove the legs, and measure the area of the leftover elytra. But this is kind of clunky and wouldn't give accurate area measurements for the elytra due to erosion. I imagine there are better segmentation techniques for these purposes but I don't know of them.
From this step, we would designate morphological features that can be used to compare beetle images for the sake of classification. I'm not sure how we could combine these types of phenotypic measurements gleaned from images with beetle anatomy ontologies @JCGiron but I would be interested in knowing. It seems that Masci et al. designed a manual GUI to do "a novel ontology-guided approach to segmentation and classification of complex immunofluorescence images of the developing mouse lung".

4. Train the model to predict annotation location, shape, size, or other desired details using the annotated training dataset and test the model using the test dataset.

Goal is to obtain a model that can accurately predict these annotated features on unannotated beetle images.
From predicted landmark locations and shapes, perhaps we can get phenotype/morphology measurements

5. Use the predicted annotations to associate beetle images with a taxon ID

I don't understand how this part goes. Do we cluster the images based on their predicted annotations/measurements (e.g. cluster in a PCA), and then label beetle images with taxon IDs based on how close the images are to clusters with known IDs? Or does the ML model predict the taxon ID associated with the image the same way it does the placement of the landmarks?
From my understanding, any species we want to identify would need to be contained within the training dataset

=================================================

Following this pipeline using the Beetlepalooza 2024 dataset:

Use individual beetle image Beetlepalooza NEON dataset as input
Species ID is already identified and associated with images
Use elytra measurements as landmarks -- I'm not sure if we can use lines as landmarks, as EB-Net and ML-morph use dots. But I believe we could possibly replace the ends of the lines with dots using position data?
Train and test ML model
Obtain beetle taxon IDs for each image from ML model landmark prediction (again, I'm not sure how this part works)

=================================================

Some questions:

Will the tilted beetles in the Beetlepalooza dataset hinder the ML model?
Does landmark prediction count as phenotype extraction? Can these measurements be associated with the Beetle Anatomy Ontology?
Are elytra width and length measurements enough to classify beetles down to a genus level? Or a species level? Do we need additional annotations?
How would this pipeline differ if we used an algorithm like OpenAI CLIP or BioCLIP? Would we need more images? Are these tools able to predict morphological measurements in addition to associating an image with an ID? Are there other kinds of image ID algorithms that don't use landmarks?
Can other types of ecological data associated with the NEON beetle samples (like sample location, sampling date, temperature/humidity on sampling date) be plugged into the model to improve its identification capabilities?
It would be helpful for me to know about the different ways that ML and AI algorithms are used to ID images, and what data the algorithms use to due it, i.e. is it based on pixel color, pixel location, pixel neighbor associations, all of these things?

=================================================

If we were able to create a model, here are some pieces of data that I would be interested in seeing:

Comparison of ML accuracy in identifying small vs larger beetles
Comparison of ML accuracy in identifying common vs rare beetles
Analysis of ML accuracy in distinguishing very similar beetle species
Analysis of the # of images needed in the training dataset to accurately identify less common beetles
Analysis testing whether it is possible to distinguish ground beetles based on elytra width/length measurements
Analysis examining intra-species variation in morphological measurements identified by the model and whether these variations can be associated with natural selection within the species

JCGiron commented 2 months ago

I fully support this plan. Some caveats for genus/species ID's will be covered by @EvanWaite in the What It Takes to Identify a Beetle bootcamp, especially related to identifying from photos alone and/or dorsal views only.

iaviney commented 2 months ago

Awesome, thanks @JCGiron. I'm looking forward to that bootcamp! I'll definitely attend to get a better idea of the potential for individual photos to provide enough information in the ID process.

sydnerecord commented 2 months ago

Great ideas in this pipeline workflow @iaviney! We will want to keep in mind the limitations of the 2018-NEON-Beetle dataset (e.g., dorsal view only) we have on hand for the workshop and consider what future imaging efforts should include (e.g., ventral, all angles of body, etc.).

Imageomics / BeetlePalooza-2024

Brainstorming pipelines to achieve ground beetle identification using AI/ML models #18