Define NER classes - Githubissues

benscott commented 3 years ago

Provisional list:

taxon name
location
person
dates
male/female etc.

llivermore commented 3 years ago

Also

Identifiers (acquisition strings/numbers, collector numbers, barcode numbers)
Type status
Ecological relationship (edge case for some microscope slides of parasites/pests but these make up a lot of microscope slides)
Institution/collection code
Elevation (unlikely to get depth with specimens used in MVP)
Location coordinates (most common are degrees, minutes, seconds or decimal degrees)

NB: A lot of these map to existing DarwinCore terms but people map to multiple terms (identifiedBy, recordedBy, scientificNameAuthorship). Also potentially the donor's name or a collection's name (when bulk acquired from a private collector).

llivermore commented 3 years ago

Following discussions with Teklia (CK) we can add, merge or delete NER classes as we go along. Still useful to have a rough list that fits with existing components and likely text outputs. Likely to use Label Studio for the entity labelling. Teklia considering integration between ArkIndex and Label Studio.

matdillen commented 3 years ago

Building and synthesizing previous posts, I've bolded the ones that we might consider for the MVP:

Taxon name: Ranging from a family name to complex infraspecific names with lots of authorship abbreviations.
Location name or description: Various hierarchies are possible. Country is a key element and may be listed elsewhere on the label as the rest of the locality description. This may also overlap with ecological descriptions (e.g. "forest", "meadow").
Person name: May also be a signature.
Date: Will be present in various different formats and may be ambiguous. Year is probably the most recognizable element as it tends to constitute 4 digits (but not always) and unlike month is rarely formatted as text or Roman numerals.
Identifier: Barcodes, accession numbers, field numbers, possibly ORCiDs. Can be very long (alpha)numeric strings and hence relatively unique, or very short numbers that are absolutely not.
Type status: May be indicated symbolically. May also be indicated in the form of a preprinted label from which the non-applicable terms have been stricken through.
Biological or ecological description: Includes relationships with other organisms, habitat types, organism descriptions, gender...
Institution/collection name or abbreviation: Often present in multiple places, as part of a stamp, of a barcode label, of a ruler and colorchart...
Geographic coordinates / elevation / depth
Location code: e.g. the Belgium IFBL system. These codes are somewhat standardized in how they are readable on the label, but often get confused for other identifiers.
Vernacular name: May also be a loan word in the native language for the taxon name, e.g. "Ranunculacées".
Degree of establishment: e.g. wild, cultivated, naturalized
Agent action: Various Latin and other abbreviations exist to indicate that a person collected the specimen, determined it, illustrated it (or on it), curated it or distributed it to other researchers. These may be useful pointers to identify person names, but also the person's relationship to the specimen.
Provenance remarks: Some labels may describe the specimen's origins and how it came to be in its current (or a previous) location, e.g. that it was part of a specific private collection for a certain time.

Cubey0 commented 3 years ago

Is the plan to take our 200 specimens from "A benchmark dataset of herbarium specimen images with label data: Summary" https://zenodo.org/record/3697797#.YJvOi7VKiUl and run them through label studio with the data in bold above-defined i.e.

Or has this already been done in ICEDIG? (if so where is this segmentation data?)

Rob

emhaston commented 3 years ago

In terms of mapping the entities to MIDS and to DWC and ABCD, here are some thoughts.

This appears to need an iterative process potentially. For example:

Person name: Step one: identification of person names on the label Step two: identification of role of person relating to the specimens, eg, collector or identifier (these are often preceded by a controlled term and are often on separate and distinct labels.

In herbarium specimens, it may be possible to identify the role of the person whose name is on the label. The main ones would be a collector or an identifier. These are often either on separate labels, and are often linked with another text string such as Coll., Leg. or Det. One question is at what stage we may want to identify these different roles? This may influence which entities we want available.

The same differentiation could be applied to date. Whether it is the date of collection of the date of an identification or a sampling event.

matdillen commented 3 years ago

Is the plan to take our 200 specimens from "A benchmark dataset of herbarium specimen images with label data: Summary" https://zenodo.org/record/3697797#.YJvOi7VKiUl and run them through label studio with the data in bold above-defined i.e. [image]

Or has this already been done in ICEDIG? (if so where is this segmentation data?)

Rob

250 of the 1800 have been processed, but will need some validation to conform to Teklia's applications. There can be multiple entities for a single text line, text lines can be quite long and there is no support for rotated boxes. Those data can be found here: https://github.com/matdillen/sdr-datasets/tree/main/herbarium I've annotated them with the six bold entities I listed in my earlier post here, based on inference from Darwin Core data available for these specimens on GBIF. The entities are added as six extra properties of the goldstandard and they have an x if they're probably present in the text line. The Darwin Core properties are also in the data file (annotated-properties-v2.json) under note.

I forked the repo because I had no git access to the dissco one.

matdillen commented 3 years ago

In terms of mapping the entities to MIDS and to DWC and ABCD, here are some thoughts.

This appears to need an iterative process potentially. For example:

Person name: Step one: identification of person names on the label Step two: identification of role of person relating to the specimens, eg, collector or identifier (these are often preceded by a controlled term and are often on separate and distinct labels.

Yes, and I think it will be three steps (on top of segmenting text lines and transcribing them):

Identify text as a person name
Link a person name to a PID (ORCiD, Wikidata QID...)
Link a person name to an action (as we've been discussing it in the Agent Attribution IG)

In herbarium specimens, it may be possible to identify the role of the person whose name is on the label. The main ones would be a collector or an identifier. These are often either on separate labels, and are often linked with another text string such as Coll., Leg. or Det. One question is at what stage we may want to identify these different roles? This may influence which entities we want available.

The same differentiation could be applied to date. Whether it is the date of collection of the date of an identification or a sampling event.

We would need a vocabulary for dates similar to what we've been building for Agents. There are dates for collection, for identification, for accession, for scanning, sampling and possibly more.

llivermore commented 2 years ago

For labelling the pinned insect dataset (https://github.com/DiSSCo/SDR/issues/2) text we used 11 classes:

Collection/Donation
Date
Determination
Expedition
Identifier
Location name
Person name
Sampling Protocol
Sex
Taxon name
Type status

We may need additional classes and/or to refine these classes once we do additional testing using the reconcilation/text processing tools (e.g., #6 , #89 , #90 )

DiSSCo / SDR

Define NER classes #4