CODAIT / text-extensions-for-pandas

Natural language processing support for Pandas dataframes.
Apache License 2.0
217 stars 34 forks source link

Clarify NER-related background material in Analyze_Model_Outputs.ipynb #114

Open frreiss opened 4 years ago

frreiss commented 4 years ago

In the notebook notebooks/Analyze_Model_Outputs.ipynb (see here), some of the terminology used may be unfamiliar to a newcomer to NLP. In particular, this paragraph could use a gentler introduction to the concepts of named entity recognition and token-level error rate:

IOB2 format is a convenient way to represent a corpus, but it is a less useful representation for analyzing the result quality of named entity recognition models. Most tokens in a typical NER corpus will be tagged O, any measure of error rate in terms of tokens will over-emphasizing the tokens that are part of entities. Token-level error rate implicitly assigns higher weight to named entity mentions that consist of multiple tokens, further unbalancing error metrics. And most crucially, a naive comparison of IOB tags can result in marking an incorrect answer as correct. Consider a case where the correct sequence of labels is B, B, I but the model has output B, I, I; in this case, last two tokens of model output are both incorrect (the model has assigned them to the same entity as the first token), but a naive token-level comparison will consider the last token to be correct.

We should add more Markdown text to this notebook in two places: