Clarifying the goals of the workshop and providing NEON-specific ideas

Hi everyone!

My name is Izzy and I am currently a first-season NEON field technician at Domain 03 in Gainesville, FL. In this post, I am (1) asking a question to clarify my understanding of our goals during the workshop, (2) providing some information regarding the NEON technician ground beetle ID process, and (3) adding a few ideas about how our workshop could speed up ground beetle identification for NEON technicians.

(1) Question: during this workshop, are we focusing specifically on how to quicken the NEON ground beetle identification process using AI/ML? Or are the workshop goals broader in scope, such as using ground beetle identification as a test-set to develop general methods of extracting information from images for the purpose of taxon identification? Both? I can imagine how these differently-focused goals might require different workshop strategies and end products.

(2) NEON technician summarized notes on the ground beetle ID process: As I don't yet have much experience IDing Carabids (AKA ground beetles, our beetle taxon of interest at NEON) in the lab, I asked some of my coworkers of varying skill levels about their experiences, in the interest of helping to shape our workshop goals. Here is a summary of what they said. I have included the full questions and answers at the bottom of this issue.

IDing the more common/easy-to-ID beetles takes approximately 5 min to 45 min
IDing the less common/harder-to-ID beetles takes approximately 45 min to 3 hrs
It is more difficult and takes longer to identify smaller beetles
Time taken to ID a beetle depends on the experience level of the NEON technician
It is sometimes required for the lead field technician and/or the taxonomist to re-ID beetles during the QC (quality check) process

(3) Some ideas/questions with the goal of quickening the NEON technician ground beetle identification process:

During the 'sorting' phase of the NEON lab technician's ground beetle processing duties, the technician separates the ground beetles (Family = Carabidae) from the rest of the insects/debris in the sample, prior to identifying the species. If it is not possible to used AI/ML to identify ground beetles down to the species level, it would still speed up the NEON technician process for an image-fed AI/ML algorithm to be able to quickly identify a beetle as belonging to Carabidae or not, particularly for tiny or uncommon beetles. Based on the comments from NEON technicians, I think it would even be useful for an AI/ML model to narrow down the identification such that technicians can save time by skipping the higher taxonomic keys and go straight to the genus level.
With the goal of NEON technician identification in mind, I don't feel too concerned about identification of extremely difficult-to-ID species that require sex-specific observation, dissection of beetle genitalia, or DNA-based identification. The technicians do not ID at this level of detail (from my understanding) and therefore in these cases, the expert taxonomist performing beetle ID QC would take the steps required to ID these special cases.
I do not have experience training AI/ML models. I am curious whether it would be possible/useful to create a model that combines different data types to perform the identification. For example, for the purpose of IDing NEON project beetles, would including the date (of sample collection), domain (e.g. Domain 03 in the Southeast), site (e.g. the Ordway-Swisher Biological Station), plot (e.g. plot 05), and trap (e.g. West trap) information along with the image data (and taxon/morphology data for the training set) improve the ability of the algorithm to identify beetles at the genus/species level? Would including this extra data obscure the biological meaning of the images, despite the relationship between the biology and the timing/location of observation? How about other NEON data collected simultaneously to the beetle samples, such as temperature, humidity, or wind speed at the collection site? Are there known associations between beetle diversity/abundance and other ecological data that could be used to inform our AI/ML model?
How can images best be taken to serve the AI/ML algorithm identification process (e.g. top, bottom, side view; how many images; how close-up of images)? How long should it take to image 1 beetle to still save time for the lab technician?

Here are the full comments from my NEON technician coworkers:

How long does it take you to ID the average beetle? Are there times when it takes a long time, and others shorter? Has it gotten a lot faster for you with more experience?

Asia Sawyer, third season NEON field technician: Usually, it can take anywhere from 15 min to 45 min to ID the more common beetles we get. The more difficult ones can sometimes take up to 2 hours depending on if they are smaller/harder to see specific parts, if I’m less familiar with that species, or if the key is very long. It can also take longer if you are less familiar with beetle anatomy because you then have to take extra time to look up what body part a key might be referring to. With more experience, it definitely goes faster, especially if you already know what family a beetle belongs to because you can skip the higher taxonomic keys and immediately go to the genus/species.
César Ortiz, second season NEON field technician: I would average time for an ID is around 15-30 minutes. It can be around 5 minutes for certain species, mostly the Pasimachus genus, since they are the most common and numerous we get. If beetles are very small or is a more uncommon species it takes longer, upwards of 45 minutes or an hour for one beetle. It definitely gets faster with experience.
Jordan Ehmke, NEON field ecology lead: it depends on the genus, for our more common genus it takes me about 5 minutes but for the tougher things it can take up to an hour. It definitely does get easier with practice and learning the keys, but can be variable based upon experience of the individual.
Keirsten Kaufmann, first season NEON field technician: It might take me 5 minutes to identify Pasimachus, they are nice and easy! It has gotten faster with experience and having someone knowledgeable in the beetle lab with me. More photo references in the keys would be helpful. On another note, a tiny carabid key would be helpful since some species have a small size, maybe there could be a specific guide for those tiny guys. I haven’t really attempted a difficult one but honestly it might take me a few hours whereas a lead like Jordan could identify it in less than 30 minutes.

Do you think the smaller beetles ever get missed because they’re so tiny, or do you think we generally catch all the carabids regardless of size?

Asia Sawyer, 3rd season NEON technician: If you’re referring to ID, then I think we generally catch all of the regardless of size since you can usually still see the enlarged trochanters on even the smaller beetles. However, I’m sure some get missed from general human error during the sorting process.
Keirsten Kaufmann, first season NEON field technician: The tiny ones can be hard to tell if it’s a carabid, I also imagine some are mistakenly put as bycatch or just thrown out entirely by mistake which could be avoided if the person looks through their waste a second or third time.

[question for field leads] Do you have to spend much time re-IDing beetles that were ID’d by less experienced NEON technicians?

Jordan Ehmke, NEON field ecology lead: I do have to re-ID some beetles if there is time to do so, but if not then the taxonomist is the ultimate last say for QC-ing the ID.

Wow, Izzy! Thanks for this insightful information. For question 1, we are hoping participants can lead a bit on the exact direction of the workshop. However, here is the motivation/scope/desired outcomes we posted to the application as a reminder to show what the organizing committee has been thinking of for the event:

Motivation The National Ecological Observatory Network (NEON) collects an unprecedented multitude of ecological and environmental data on a continental scale through field sampling and remote sensing. As part of the field sampling, NEON collects, counts, and identifies biological specimens of environmental indicator species and those filling ecologically important roles. The underlying processing of specimens is often manual and time consuming and is limited to taxonomic identification and counts. This presents a unique opportunity with potentially long-lasting impact to explore the potential and limitations of AI/ML-driven automation for biodiversity data collection efforts, including by developing and utilizing multi-modal ML computer vision models that take advantage of imagery, remote sensing, and environmental data. One taxonomic group especially ripe for proof-of-concept is beetles, one of the world's most diverse groups that serve important roles in pollination of plants and as indicator species providing early warning signals of environmental change. NEON collects ground beetles (Carabidae) from across the continental United States, Puerto Rico, and Alaska using pitfall traps. These specimens are then sorted and identified by NEON staff and other taxonomic experts in a painstaking, manual process that can take over a year and concludes with publishing counts of beetle species on NEON’s data portal. What if, rather than publishing counts of species, NEON captured and published images of beetles? Can we develop an automated process to more efficiently derive species counts from the imagery? Is it possible to use imagery to measure important characteristics (known as functional traits, such as body size) of the different beetles that are collected? This workshop offers an opportunity to develop a proof-of-concept to demonstrate how the application of computer vision techniques could transform how ground beetle community data are collected, and thus biodiversity data more generally. Moreover, the treasure trove of data products collected at NEON sites provides the opportunity to push multi-modal model development applied to computer vision in tackling this challenge. Desired Outcomes We aim to facilitate outcomes that address the potential of and need for ML-ready biological image datasets to extract information about biodiversity, including (but not limited to!) the following:

A prototype workflow for extracting trait and species identifications from images of NEON beetle specimens. Our goal is for this workflow to be reproducible, follow FAIR guiding principles, and understandable by biologists and computer scientists. Publication of open data products containing species identifications and functional traits derived from beetle specimen images. A peer-reviewed publication describing best practices for AI/ML ready biological specimen data and images. This information will be accompanied by a white paper to be presented to NEON with advice for how to move forward with efforts to make the Observatory’s data more AI/ML ready. Scope We are keeping the scope of possible projects focused on the extraction of species identifications and trait measurements from beetles to maximize the limited time we have in the workshop. That notwithstanding, we expect the event to connect people with domain science-focused goals, such as biologists interested in datasets that help answer biological questions, to people with ML-focused goals, such as ML researchers interested in domain science questions for which to develop algorithms and models.

We generally expect datasets curated at or for the event, as well as tools or methods developed, to satisfy FAIR principles, and where applicable also CARE principles.

@sydnerecord This is great, thank you for posting this information! Very helpful.

@iaviney, This is great background to understand the classification work that takes place at NEON! Take a look at the What It Takes to Identify a Beetle bootcamp. It would be great to see how your workflow feeds into the whole sampling/identifying/getting data scheme and learn about the typical tools you use for beetle ID and how those can be improved. Could you share the typical workflow of what happens to a sample once you get it from the field? I'm curious about where in your typical workflow would the incorporation of AI prediction speed up the whole sorting/IDing process.

As for your comment

I am curious whether it would be possible/useful to create a model that combines different data types to perform the identification."

Earlier this week in conversations with @EvanWaite, @isabetabug, and @Chimbada the point that identification tools could potentially be optimized if only applied to species collected per site and there are some NEON resources (perhaps @EvanWaite can share a link to the keys that are in development). Some limitations with rare species apply here.

For your point

How can images best be taken to serve the AI/ML algorithm identification process (e.g. top, bottom, side view; how many images; how close-up of images)? How long should it take to image 1 beetle to still save time for the lab technician?

I think besides the informatics standpoint, a big consideration is what views and characters are needed for species (or even genus ID. This is highly variable and taxon-specific. It would be good to hear from the informaticians about requirements for AI models.

@JCGiron I will reply to your comment more thoroughly once I get to the hotel tonight, but here is a brief response. Once we have collected a sample from the field, we perform an 'Ethanol rinse' as soon as possible, usually on the same day. The Ethanol rinse involves pouring the sample (containing organisms and debris caught in propylene glycol) over a filter, rinsing the sample with DI water to remove propylene glycol, and then storing the sample in 95% Ethanol. At this point, any 'vertebrate bycatch' is removed, which could be small mammals, frogs, etc that got stuck in the trap along with the invertebrates. Then, when there are enough lab technicians available to do so, the invertebrates from the samples are 'sorted' to separate the ground beetles (Carabidae) from the rest of the invertebrates. From my understanding, the presence of enlarged trochanters is generally the feature used to identify a beetle as a Carabid, but I can ask my coworkers for more info on that. After that, again when enough lab technicians are available, the ground beetles are ID'd to the species level using dichotomous keys. I have asked our beetle lead (Jordan) if I can share some of the ID resources we use at D03 with you all.

I am thinking that we could use the ML tools to intervene at a couple points, potentially. Perhaps the AI/ML model could be useful during the invertebrate sorting process, to verify whether a beetle is a Carabid or not. I'm thinking this could be particularly useful for the tiny beetles, as those are harder to identify as Carabids given the smaller features. Additionally, if the model is accurate enough, perhaps it could also be used for the beetle identification point, so that technicians don't have to spend as much time going through the dichotomous keys. Ideally, the model could identify a beetle species from the image, but if not, it would also be useful if the model could narrow down the specimen to a taxonomic group above the species level to lessen the number of keys the technician has to go through.

That is very interesting to hear about the potential site-specific models/tools. I'm interested to hear more and talk more about this! Thank you for the info!

@Nagelle will hopefully be coming with some simplified dichotomous keys used by the field technicians in each domain to ID beetles that I requested

@sydnerecord Oh great! How would you like to use the keys? Are you thinking that those keys could somehow be incorporated into the computational process?

@iaviney I am hoping we can use them to ID some of the more common species (if the morphological traits can be seen dorsally). They could serve as the basis for an ID algorithm

@sydnerecord Cool! Would this algorithm also be AI/ML or more of some sort of branching decision maker based on if-then type statements? I'm sorry if this message isn't too clear, I'm not as familiar with some coding/algorithm terminology.

Imageomics / BeetlePalooza-2024

Clarifying the goals of the workshop and providing NEON-specific ideas #17