Automatic data curation and feature extraction of museum images of critically endangered mollusks

Imageomics / Image-Datapalooza-2023

Repository for the Image Datapalooza 2023 event held at OSU in August 2023.

Creative Commons Zero v1.0 Universal

3 stars 2 forks source link

I'm bringing an image dataset (about 8,000 images) that contain two different angle views of about a half a million specimens of North American freshwater bivalve shells. Freshwater bivalves are the most endangered animals on the planet, and many of these species have suffered serious population level declines and range contractions over the course of the last century. The OSU Museum of Biological Diversity Mollusk Division houses the largest freshwater bivalve collection in the world, and furthermore contains about a quarter of all known museum specimens of endangered, threatened, and extinct species. We have specimens not only from the majority of watersheds in North America, but in many cases from the same sites collected at multiple different time periods. This makes the OSUM Mollusk Division's collection a very powerful resource to ask questions about continental-scale changes in phenotype correlated with anthropogenic disturbance (dams, pollution) and climate change.

The dataset consists of images of whole drawers of specimens from two angles -- top down, and 45º. The drawers contain individual boxes of specimens called "lots". 1 lot is the set of all the specimens of a species collected at a single place and time. All lots in the collection have a unique numeric catalogue number which is printed on the top right corner of a cardstock label in the box. All images were taken using the same lighting setup and contain an Calibrite ColorChecker Nano and a QP Card QP101 Calibration Card with mm scale bar.

(A sample from the dataset can be downloaded here)

My goal is to get help to use CV / ML methods to:

segment both images of each drawer of specimens into lots.
Use OCR to capture the catalogue number of each lot from its label and add the number to the image metadata
assign GUIDs to the images and make the dataset available online for use for morphological analysis.

I would definitely be interested in testing some hypotheses about the distribution of different morphological traits and color patterns using this dataset. It would be the largest dataset of its kind in existence for mollusks. Please reach out if you're interested in collaborating on some or all of this! -Nate

Originally posted by @nfshoobs in https://github.com/Imageomics/Image-Datapalooza-2023/issues/3#issuecomment-1676450117

Additional info to address questions from pitch:

All images are from the same 2 cameras, the 2 view images are taken simultaneously for each drawer, and contain the same color checker and scale bar.
A catalogue number is a UID, and is typewritten or printed on each label.
The labels that bear the catalogue number are generally in the same position in each box (upper left) and the catalogue number is in the same part of each label (upper left).
The catalogue numbers all have accompanying metadata in the collection database (i.e. taxonomic identification with GUID to global names architecture, collecting date, geolocation, ecological description, condition of each shell (weathered, freshly dead, etc)
There are not multiple box lots that have the same catalogue number in the dataset (i.e. if two boxes have the same catalogue number, it would probably be due to a clerical error in the collection, or an error in the OCR).
The number of boxed samples per drawer vary. There can be as few as 1 (at the extreme side, we have a single sample of many very large specimens taking up a whole drawer), or as many as ~80 (in the case of a drawer containing many very small boxed samples)
The number of individuals per boxed sample vary. Could be as few as 1, or as many as 1240 (though the majority of samples have fewer than 30 specimens). We also have a count of individuals for each catalogue number in the database. -There are fewer than 10 box sizes (there is a set of standard sizes we use), and these boxes have edges that are generally contrasty. See example files. This should help segmentation.

One major potential of this research is that the ending dataset would be a large annotated set of species images that are expertly identified, which can be then used to train a model that identifies new images automatically. This would be incredibly useful for conservation and management, as "ability to identify mussel species" is a rare skill and there is high demand for ID expertise from state and federal wildlife agencies.

Imageomics / Image-Datapalooza-2023

Automatic data curation and feature extraction of museum images of critically endangered mollusks #8