Imageomics / BeetlePalooza-2024

BeetlePalooza collaborative hands-on development event to be held at OSU August 12-15
Creative Commons Zero v1.0 Universal
3 stars 0 forks source link

User Friendly Hierarchical Data Labelling? #12

Open quitmeyer opened 1 month ago

quitmeyer commented 1 month ago

Sorry if this is a double post, my internet went weird, and it seem like it didn't post. Also i guess this is where we can share ideas and stuff to talk about at the palooza?

Challenge: Quick Taxonomical Labelling

So we got a system where we can collect lots of images with lots of beetles on them (and other critters). We then trained a Yolo that is really great at detecting critter vs background. Now we have taxonomists hired who are willing to go through the long, arduous work of looking through all our images of beetles (and other critters) and label them as taxonomically accurate as they can. This is to help create some reference data for training AIs and ideally starting to classify them.

So for each image, that has one insect on it, we need to be able to rapidly give it a label. There are some AI data image labelling softwares out there like Labelimg, or X-anylabelling, which let you load up a folder and start drawing rectangles and giving them a flat ID. But I have been looking for a way to label things for something like Bioclip which uses a nested Hierarchy of classes in a nice user-friendly way.

Other Labelling Systems aren't Hierarchical

I've seen many different labeling applications out there, but haven't seen anything that would let you label hierarchically. And this is the key. We don't care as much about ID-ing specific species, and as other people mentioned there are THOUSANDS AND THOUSANDS of beetles out there, many of which you can't tell the species unless you are microscoping penises. Instead we are just looking to label things in a taxonomic hierarchy as best as possible (which is how BioCLIP works too!).

And we need to be able to do this in a quick, user-friendly way. It should have all the possible taxonomic labels from something like Tree-of-life already loaded up.

Interface Idea

I imagine an interface where you load a folder and it clusters images by a guessed at similarity (As you see many pics will be nearly identical as they are multiple photos of the same insect that hadn't moved much).

Untitled_8

It pops up an individual pic, you have macros that let you quickly choose the Class, order, family, etc... as far as you can, or give it labels like other things like (smudge, dirt, wrong detection, more-than-one-creature). It would probably have a set of most recent classifications too where you could rapidly use the same ID that might pop up over and over. And if you label something, you could maybe select a set of images in the folder and apply that same label to them.

And again importantly, you aren't really trying to give a specific specie ID, but rather ID up to the taxonomy where you are 100% confident it's that. For our work, studying biodiversity data if we can just label 50% to the family level, and maybe 10% to the Genus level, and the rest just to order level that would be huge! And even a thing that just sorted orders and let us quickly pull out just beetles for beetlepalooza, would be great too! ;)

(And maybe there could be a system where you label up to the level you are 100% confident, but you can also add a note or something about what you think it MIGHT be in the next level .)

Another important thing is that this labeling system should work iteratively. That is someone taxonomically not that talented, like me, should be able to go through a bunch of images and perhaps just group them by class or order. And then a talented human or robot could go through and try to narrow down those classifications to further to family or genus levels, and a final "expert" could ID even further or just confirm the IDs.

There must be something like this out there right?

It feels like there should be something out there that lets you do this, but I have been asking around and haven't found anything. Most labeling software are for things like Yolo, where you are trying to get a basic flat ID on a thing within a specific location in a bigger image. Instead we want a hierarchical label on a whole image (no need to draw rectangles!)

but it's looking like we might have to roll our own software!?!?

sydnerecord commented 1 month ago

Nice idea! To clarify, is this more of an interface for building training data?

quitmeyer commented 1 month ago

Nice idea! To clarify, is this more of an interface for building training data?

Yeah! To build a big collection of training data from data captured in the field.

But then it can also serve as an iterative processing step to refine data that comes from perhaps non-experts or an AI model that hasn't been trained that well yet.

1)here's a 2000 photos of random insects,

2) And then we have some non-experts go through And label them as good as they can (like maybe just to the order level: This looks like a beetle, This looks like a moth...")

3) we train an AI to do this sorting

4) the AI (like bioclip) tries to classify 8000 more new images captured from the field to probably the order level

5) we give the 10,000 images of critters somewhat sorted already to some expert taxonomists who a) verify if the original classification is correct, and b) try to add deeper classification to like the family or genus

6) we retrain the ai on this even better datasets

(And on and on)

sydnerecord commented 1 month ago

Something similar to this idea has been done with herbarium data to crowd source information on plant phenological traits by Chuck Davis' group. Crowd sourced plant phenophase data are collected via Amazon's Mechanical Turk operation, then those data enter into a CNN model, and Chuck is now iterating on that process. https://www.frontiersin.org/journals/plant-science/articles/10.3389/fpls.2020.01129/full. It would be great to do something like this for invertebrates. I bet we could get buy in from people who work on BugGuide