DiSSCo / SDR

Specimen Data Refinery
Apache License 2.0
7 stars 0 forks source link

Presentation: SDR at SPNHC Conference #76

Closed llivermore closed 2 years ago

llivermore commented 2 years ago

Deliver presentation on the SDR at the June 2022 SPNHC conference in Edinburgh, session on "Identifiers and labels in natural history collections: new technologies, challenges and opportunities for linking objects and data". Details TBC

llivermore commented 2 years ago

Abstract

There are two main rate limiting steps in mass digitisation of natural history collections: 1) physical handling - the rate at which we can retrieve, select and prepare specimens for digitisation, then returning them to collections; 2) the extraction of data from images - either from the specimen itself or from its labels - e.g. measurements, transcription, georeferencing.

Over the past three years we have been developing the Specimen Data Refinery (SDR) to dramatically scale up the extraction of data from specimen images in an automated way that conforms to FAIR (Findable, Accessible, Interoperable and Repeatable) principles. The SDR uses a series of machine learning models, packed into modular tools, that perform semantic segmentation, optical character recognition, hand-written text recognition, barcode reading and natural language processing to identify labels, text lines, and named entities.

We present the SDR and an evaluation of its use in automating the linkage between specimens, their UIDs, and for related linked data like taxonomy, people and geographic names. We will discuss outstanding challenges and potential for future development.

llivermore commented 2 years ago

Presentation available here: Livermore, Laurence; Brack, Paul; Scott, Ben; Woolland, Oliver (2022): Specimen Data Refinery: A novel approach to automating digitisation. figshare. https://doi.org/10.6084/m9.figshare.19947845.v2