Imageomics / BeetlePalooza-2024

BeetlePalooza collaborative hands-on development event to be held at OSU August 12-15
Creative Commons Zero v1.0 Universal
3 stars 0 forks source link

Identifying species with images #10

Open sydnerecord opened 1 month ago

sydnerecord commented 1 month ago

How well can foundation models for the tree of life (e.g., BioClip) identify one of the world's most speciose taxanomic groups (beetles)? If NEON could take pictures of beetles and couple them with AI to determine species IDs, this would save hundreds of hours of field staff time each year.

JCGiron commented 1 month ago

Identifying beetles to species is challenging, especially in large groups of beetles like ground beetles, but there are many more examples. Images would HAVE TO show diagnostic features, which may be genus-specific to be able to differentiate between species. Furthermore, there are groups of beetles for which species can only be identified by dissecting male specimens, as the external morphology may be VERY uniform.

For instance, the images below show two different genera of water beetles.

image Habitus of Novochares sp.

image Habitus of Peltochares sp.

In my view, the question may be what would it take for AI or computer vision to be able recognize several species, or even genera, that may not be externally distinguishable as different, even when positioning of the specimens may vary? What would a training set look like? Is it possible to train a model to recognize diagnostic features to at least get to genus and morphospecies?

The images above come from this paper: Girón JC, Short AEZ (2021) The Acidocerinae (Coleoptera, Hydrophilidae): taxonomy, classification, and catalog of species. ZooKeys 1045: 1-236. https://doi.org/10.3897/zookeys.1045.63810 Figs 42 and 44.

hlapp commented 1 month ago

Great example and question @JCGiron! From the phylogeny (Fig. 2) it seems that their geographic ranges do not overlap, or at least only barely so? That is, if one had the occurrence location, would that not impose a strong prior on the species, if it were between the two?

(I'll add that the BioCLIP prediction does actually not even include either of the genera when given either of the images, cropped or not.)

JCGiron commented 1 month ago

@hlapp, yes, this would be an example of an "easy" geography-based solution, but for example, within Novochares, more than one species that look essentially the same externally can be found at a single locality in their range in South America, in a way that for example females, cannot be identified by morphology, and would need to be DNA barcoded. My guess is that the same can potentially happen with carabid species at NEON sites, but we would need to know more about the general "intra-sample" diversity or each NEON sample. At this point, at least with my limited (and taxonomically skeptical) knowledge of AI, I think the best one can hope for, given decent quality photos are clusters of look-alikes, but not necessarily reliable species IDs. I'm not familiar with BioCLIP. Is it a tool that needs fo be fed with imagery?

hlapp commented 1 month ago

I'm not familiar with BioCLIP. Is it a tool that needs fo be fed with imagery?

Yes. See BioCLIP website for more on the model and the dataset with which it was trained. A demo app is on Hugging Face. (There's also a Python package if you were interested in building it into a pipeline.)

JCGiron commented 1 month ago

Thank you, @hlapp!!

I did a few tests with dorsal views of beetles from publications and the results are not terrible (at least it recognizes that they are beetles), but also not great (none of the submitted ones recovered the genus and in only one case the family was correct, so there is variability). Perhaps this could be a project idea: Run tests on beetle images to see how accurately can the app recognize the taxonomic group, potentially feeding the tool to improve accuracy? A source for images can be the the Plazi taxonomic treatments dataset from this thread.

EvanWaite commented 1 month ago

I, likewise, share Jennifer’s questions about species IDs from photos. There are over 2,500 species of carabids in the US and even with a specimen in hand, some of them can be extremely hard to identify (dissecting genitalia, differences in microsculpture, number of hairs under the tarsi, characters on the ventral aspect of the beetle etc). Other factors like whether the specimen is pinned, what position it’s in, and how standardized these individuals are could contribute a lot of complexity to this process.

I know there are moderately successful AI identification resources for insects (iNaturalist comes to mind) but they’re built upon training sets of many, many observations. With these NEON samples, there’s no guarantee of getting a series of a certain species. One case that comes to mind is the Harpalus sugenus Megapangus. There are 2 species that are virtually identical (differences in genitalia as well as small punctures on the head): H. caliginosus is widespread and quite common and then there’s also H. katiae which is much rarer. NEON has almost 400 caliginosus while they only have 5 katiae. With a sample size like that, the training set is not going to be very robust for katiae and it’s possible we’d be overlooking similar taxa. image H. caliginosus image H. katiae

Additionally, I think new taxa at a site or for the project might cause a hangup as well. There was a newly described (Messer and Raber 2021) species collected at the D14 site in Tucson last year and the external feature to separate it from others in the genus is “Microsculpture cells stretched transversely 2:1-3:1” which is not something a dorsal habitus photo would show. That was just a new species for the project, but NEON has also provided specimens of species new to science (Liebherr and Will 2022) from the genus Mecyclothorax. which has an incredible 239 species in Hawaii I’d be curious how AI would handle things like this.

image Selenophorus pumilus is in the red box. Shown alongside seven other Selenophorus sp. at the same site

image

Figure 1. From Liebherr and Will 2022 with the new Mecyclothorax spp.

Refs: Messer, P. W., & Raber, B. T. (2021). A review of Nearctic Selenophorus Dejean (Coleoptera: Carabidae: Harpalini) north of Mexico with new species, new synonyms, range extensions, and a key. The Coleopterists Bulletin, 75(1), 9-55. Will, K., & Liebherr, J. K. (2022). Two new species of Mecyclothorax Sharp, 1903 (Coleoptera: Carabidae: Moriomorphini) from the Island of Hawai‵ i. The Pan-Pacific Entomologist, 98(1), 1-17.

Photos: Jordan Patterson

JCGiron commented 1 month ago

@hlapp or others familiar with BioCLIP: Is there a way to know how many species of beetles or carabids were used in the training set? iNat21 contains 663,682 images corresponding to 2,526 species of insects (no idea how many beetles/carabids), but I couldn't find numbers for TreeOfLife-10M. Would it be worth creating a library of carabid images to feed BioCLIP?

hlapp commented 1 month ago

Is there a way to know how many species of beetles or carabids were used in the [BioCLIP] training set?

cc @egrace479

iNat21 contains 663,682 images corresponding to 2,526 species of insects (no idea how many beetles/carabids), but I couldn't find numbers for TreeOfLife-10M. Would it be worth creating a library of carabid images to feed BioCLIP?

Possibly. Especially if annotated with trait descriptions they could be quite helpful, in the sense of adding to what we have. There's a lot of images aggregated through GBIF and EOL, and those we would normally already collect. (And there's the BioSCAN1M and more recently a larger version, focusing on insects. Not sure how many beetles, and in particular carabids are in there.)

thompsonmj commented 1 month ago

Is there a way to know how many species of beetles or carabids were used in the [BioCLIP] training set?

Per the TreeOfLife-10M catalog [here], there are 46,734 entries where Family is "Carabidae" in the train split.

thompsonmj commented 1 month ago

For all beetles (order "Coleoptera"), there are 401,657 total in the train split. Within this order, there are 34,949 unique 'Genus species' binomials.

More generally to address the question, you can load the catalog.csv into a Pandas DataFrame and ask any questions you like.

thompsonmj commented 1 month ago

Some more interesting beetles stats from TreeOfLife-10M (how many from each data source):

>>> print(f"Coleoptera from EOL: {eol_count}")
Coleoptera from EOL: 298929
>>> print(f"Coleoptera from BIOSCAN: {bioscan_count}")
Coleoptera from BIOSCAN: 44928
>>> print(f"Coleoptera from iNat21: {inat21_count}")
Coleoptera from iNat21: 57800
sydnerecord commented 1 month ago

My aim here is really to consider if we can train an AI model to classify beetles as well as a non-expert NEON field technician. At the NEON field sites, some initial sorting of beetles is done by the field crew. If we could even help to distinguish those taxa, it would be a great step forward. NEON staff will be providing the abridged site-specific taxonomic keys that they use to do this identification. I'm hoping to have those in hand to share by next Monday. Perhaps we could use those keys to help us see if we can train an AI model to ID those 'easier' taxa

egrace479 commented 1 month ago

67 of the 78 unique species (determined through <genus> <species> scientific name matching) from the Beetle dataset are in TreeOfLife-10M labeled to that level (2,968 images). 34 of the 36 genera are represented in TreeOfLife-10M (27,577 images). Beetle_sciName_overlap_tol.csv Beetle_genus_overlap_tol.csv

It looks like there are about 2500 beetles whose scientific name designation does not match the <genus> <species> in TreeOfLife-10M exactly; however, there are only 8 of the 2 unmatched genera:

Screenshot 2024-08-06 at 5 51 04 PM

There are also 17 beetles that do not have taxonomic information from NEON.

JCGiron commented 1 month ago

@egrace479, thank you for that thorough summary!

Is there a way to see those images? at least some of them, just to get an idea of what the model is feeding on? I have no idea how to explore the data in the platform.

From @sydnerecord

My aim here is really to consider if we can train an AI model to classify beetles as well as a non-expert NEON field technician.

So the idea is that the AI helps sorting to morphospecies, but not necessarily assign species names. Correct? If so, where in the process would the AI intervene? before sending specimens for accurate ID? would this actually save time given that the manual process of generating images to get them sorted by the AI and getting traits measured would still be needed?

egrace479 commented 1 month ago

@JCGiron

Is there a way to see those images? at least some of them, just to get an idea of what the model is feeding on? I have no idea how to explore the data in the platform.

Unfortunately, the HF dataset viewer can be a bit finicky with larger datasets (at least it's a feature they keep improving!). We do have a demo (still in dev-mode) that will return a randomly selected sample image of the predicted taxa from the EOL portion of the dataset. There are only a few samples per species, but most of the genera are from EOL, so it could give some sample images for now.

sydnerecord commented 1 month ago

@JCGiron I agree that the morphospecies identification would need to be coupled with automated processing of specimens. This is something I am exploring with another project

JCGiron commented 1 month ago

Here is a resource that might be worth looking into. It is for identifying species of bees: "BeeMachine uses a convolutional neural network, modified from EfficientNetV2, and was trained on over 1.2 million images." Spiesman, B.J., Gratton, C., Hatfield, R.G. et al. (2021) Assessing the potential for deep learning and computer vision to identify bumble bee species from images. Scientific Reports 11, 7580. https://doi.org/10.1038/s41598-021-87210-1