DiSSCo / SDR

Specimen Data Refinery
Apache License 2.0
6 stars 0 forks source link

Define NER classes #4

Closed benscott closed 2 years ago

benscott commented 3 years ago

Provisional list:

llivermore commented 3 years ago

Also

NB: A lot of these map to existing DarwinCore terms but people map to multiple terms (identifiedBy, recordedBy, scientificNameAuthorship). Also potentially the donor's name or a collection's name (when bulk acquired from a private collector).

llivermore commented 3 years ago

Following discussions with Teklia (CK) we can add, merge or delete NER classes as we go along. Still useful to have a rough list that fits with existing components and likely text outputs. Likely to use Label Studio for the entity labelling. Teklia considering integration between ArkIndex and Label Studio.

matdillen commented 3 years ago

Building and synthesizing previous posts, I've bolded the ones that we might consider for the MVP:

Cubey0 commented 3 years ago

Is the plan to take our 200 specimens from "A benchmark dataset of herbarium specimen images with label data: Summary" https://zenodo.org/record/3697797#.YJvOi7VKiUl and run them through label studio with the data in bold above-defined i.e. image

Or has this already been done in ICEDIG? (if so where is this segmentation data?)

Rob

emhaston commented 3 years ago

In terms of mapping the entities to MIDS and to DWC and ABCD, here are some thoughts.

This appears to need an iterative process potentially. For example:

Person name: Step one: identification of person names on the label Step two: identification of role of person relating to the specimens, eg, collector or identifier (these are often preceded by a controlled term and are often on separate and distinct labels.

In herbarium specimens, it may be possible to identify the role of the person whose name is on the label. The main ones would be a collector or an identifier. These are often either on separate labels, and are often linked with another text string such as Coll., Leg. or Det. One question is at what stage we may want to identify these different roles? This may influence which entities we want available.

The same differentiation could be applied to date. Whether it is the date of collection of the date of an identification or a sampling event.

matdillen commented 3 years ago

Is the plan to take our 200 specimens from "A benchmark dataset of herbarium specimen images with label data: Summary" https://zenodo.org/record/3697797#.YJvOi7VKiUl and run them through label studio with the data in bold above-defined i.e. [image]

Or has this already been done in ICEDIG? (if so where is this segmentation data?)

Rob

250 of the 1800 have been processed, but will need some validation to conform to Teklia's applications. There can be multiple entities for a single text line, text lines can be quite long and there is no support for rotated boxes. Those data can be found here: https://github.com/matdillen/sdr-datasets/tree/main/herbarium I've annotated them with the six bold entities I listed in my earlier post here, based on inference from Darwin Core data available for these specimens on GBIF. The entities are added as six extra properties of the goldstandard and they have an x if they're probably present in the text line. The Darwin Core properties are also in the data file (annotated-properties-v2.json) under note.

I forked the repo because I had no git access to the dissco one.

matdillen commented 3 years ago

In terms of mapping the entities to MIDS and to DWC and ABCD, here are some thoughts.

This appears to need an iterative process potentially. For example:

Person name: Step one: identification of person names on the label Step two: identification of role of person relating to the specimens, eg, collector or identifier (these are often preceded by a controlled term and are often on separate and distinct labels.

Yes, and I think it will be three steps (on top of segmenting text lines and transcribing them):

In herbarium specimens, it may be possible to identify the role of the person whose name is on the label. The main ones would be a collector or an identifier. These are often either on separate labels, and are often linked with another text string such as Coll., Leg. or Det. One question is at what stage we may want to identify these different roles? This may influence which entities we want available.

The same differentiation could be applied to date. Whether it is the date of collection of the date of an identification or a sampling event.

We would need a vocabulary for dates similar to what we've been building for Agents. There are dates for collection, for identification, for accession, for scanning, sampling and possibly more.

llivermore commented 2 years ago

For labelling the pinned insect dataset (https://github.com/DiSSCo/SDR/issues/2) text we used 11 classes:

  1. Collection/Donation
  2. Date
  3. Determination
  4. Expedition
  5. Identifier
  6. Location name
  7. Person name
  8. Sampling Protocol
  9. Sex
  10. Taxon name
  11. Type status

We may need additional classes and/or to refine these classes once we do additional testing using the reconcilation/text processing tools (e.g., #6 , #89 , #90 )