Fine-tuning transformer-based models and IOB2 format

matirojasg commented 2 years ago

Hi, I had a question about the NER task and the transformer-based models. We know that we perform a token-level classification instead of a span-level when performing fine-tuning and adding a linear layer for classification. This means that models will not necessarily follow the IOB2 format (I: Inside, O: outside, B: Beggiging). I would like to know what happens, for example, when a token is split into its subtokens. Is the label of the original token preserved in the subtokens? There will clearly be more entities than the original ones if this is the case.

What happens when the final metrics are calculated? Is the label of the first subtoken taken, and will that be considered as the label of the original word?

I hope the doubt is understood :)

Example:

Original labels (1 entity):

Colon B-Disease Cancer I-Disease

Word Piece labels (2 entities):

Co B-Disease lon B-Disease Cancer I-Disease

If the prediction is the following:

Co B-Disease lon I-Disease Cancer I-Disease

Is that considered a true positive or false negative?

helpmefindaname commented 2 years ago

The predictions are on token level. You can choose how the embedding for a token is aggregated if it consists of several subtokens. This is done by setting the subtoken_pooling parameter to either first last first_last or mean. The default is first.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

flairNLP / flair

Fine-tuning transformer-based models and IOB2 format #2764