Imageomics / Image-Datapalooza-2023

Repository for the Image Datapalooza 2023 event held at OSU in August 2023.
Creative Commons Zero v1.0 Universal
3 stars 2 forks source link

Automatic data curation and feature extraction of museum images of critically endangered mollusks #8

Open nfshoobs opened 1 year ago

nfshoobs commented 1 year ago

I'm bringing an image dataset (about 8,000 images) that contain two different angle views of about a half a million specimens of North American freshwater bivalve shells. Freshwater bivalves are the most endangered animals on the planet, and many of these species have suffered serious population level declines and range contractions over the course of the last century. The OSU Museum of Biological Diversity Mollusk Division houses the largest freshwater bivalve collection in the world, and furthermore contains about a quarter of all known museum specimens of endangered, threatened, and extinct species. We have specimens not only from the majority of watersheds in North America, but in many cases from the same sites collected at multiple different time periods. This makes the OSUM Mollusk Division's collection a very powerful resource to ask questions about continental-scale changes in phenotype correlated with anthropogenic disturbance (dams, pollution) and climate change.

The dataset consists of images of whole drawers of specimens from two angles -- top down, and 45º. The drawers contain individual boxes of specimens called "lots". 1 lot is the set of all the specimens of a species collected at a single place and time. All lots in the collection have a unique numeric catalogue number which is printed on the top right corner of a cardstock label in the box. All images were taken using the same lighting setup and contain an Calibrite ColorChecker Nano and a QP Card QP101 Calibration Card with mm scale bar.

(A sample from the dataset can be downloaded here)

My goal is to get help to use CV / ML methods to:

  1. segment both images of each drawer of specimens into lots.
  2. Use OCR to capture the catalogue number of each lot from its label and add the number to the image metadata
  3. assign GUIDs to the images and make the dataset available online for use for morphological analysis.

I would definitely be interested in testing some hypotheses about the distribution of different morphological traits and color patterns using this dataset. It would be the largest dataset of its kind in existence for mollusks. Please reach out if you're interested in collaborating on some or all of this! -Nate

Originally posted by @nfshoobs in https://github.com/Imageomics/Image-Datapalooza-2023/issues/3#issuecomment-1676450117

nfshoobs commented 1 year ago

Additional info to address questions from pitch:

One major potential of this research is that the ending dataset would be a large annotated set of species images that are expertly identified, which can be then used to train a model that identifies new images automatically. This would be incredibly useful for conservation and management, as "ability to identify mussel species" is a rare skill and there is high demand for ID expertise from state and federal wildlife agencies.