BritishGeologicalSurvey / geo-ner-model

Stratigraphic Named Entity Recognition with Stanford CoreNLP
Creative Commons Attribution Share Alike 4.0 International
16 stars 5 forks source link

Tagging format #1

Open blester125 opened 4 years ago

blester125 commented 4 years ago

In most NER datasets there is some sort of span labeling scheme where prefixes like B- or I- are used to separate mentions of the same type that are adjacent.

In the data it looks like there isn't a span labeling scheme used.

the O
Mearns  LEXICON
Glacigenic  LEXICON
Subgroup    LEXICON
of O

Are there no mentions in the datasets that touch or am I missing some strategy that delims them?

jeromemassot commented 4 years ago

Hi blester125, In fact the B- and I- notations are related to n-grams : i.e. when a particular entity is made of several items. But, if two entites of the same label are following each others but are distinct, they should have been tagged with the B-prefix each time.

So, I could understand why this notation has not been reproduced in the lexicon, which is only a glossary.

Mapping from the lexicon entries to the B- and I- notation is quite easy : for each entry, split the term using "space" as the separator and prefix the first token with B- and the following ones with I-.

Best regards Jerome

metazool commented 3 years ago

Thank you, I missed this discussion. The annotation format is the one CoreNLP suggests here:

https://stanfordnlp.github.io/CoreNLP/ner.html#training-or-retraining-new-models

There must be an assumption that the tagged tokens, if not separated by an O tagged tokens, are part of a contiguous entity. That's the assumption made by brat javascript renderer on the CoreNLP server's visual output. It won't always hold, will it, semantically? I've never looked in to the underlying LSTM.

I would be glad to hear of alternative more sophisticated approaches!

metazool commented 3 years ago

As for the source references from which the annotated sentences were extracted (during an unrelated project in the early 2000s) , many but not all of them are available as JP2 scans under an Open Government Licence. The list of sources is here: https://github.com/BritishGeologicalSurvey/geo-ner-model/blob/main/REFERENCES.md