Franck-Dernoncourt / NeuroNER

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.
http://neuroner.com
MIT License
1.69k stars 479 forks source link

Drastic improvement using word embeddings (+20%) - explanation? #129

Open svanhvitlilja opened 5 years ago

svanhvitlilja commented 5 years ago

This is not a software issue, we're just wondering whether anyone can shead some light on the results we're seeing.

We've been working on an Icelandic named entity recognizer using NeuroNER. Our training corpus contains 200,000 tokens, thereof around 7000 named entities.

We are seeing huge improvements by incorporating external word embeddings, from F1=61% (without word embeddings) to F1=81%.

This is great news, but we would like to understand why this is happening. Has anyone here experienced such a big jump in performance when incorporating word embeddings?

I'm wondering whether the fact that Icelandic is a morphologically complex language explains why the word embeddings are working so well? First experiment with word embeddings was done on 500.000 Icelandic words, and it gave us F1=75%. Then we created word embeddings from 500.000.000 words, and F1 went up to 81%.

Looking for ideas, thoughts, stories from anyone who has tried NeuroNER with and without word embeddings.

Best regards!

kaushikacharya commented 4 years ago

@svanhviti16 My understanding is that the improvement is due to character embedding.

https://arxiv.org/abs/1606.03475 The paper mentions:

While the token embeddings capture the semantics of tokens to some degree, they may still suffer from data sparsity. For example, they cannot account for out-of-vocabulary tokens, misspellings, and different noun forms or verb endings.

We address this issue by using character-based token embeddings, which incorporate each individual character of a token to generate its vector representation. This approach enables the model to learn sub-token patterns such as morphemes (e.g., suffix or prefix) and roots, thereby capturing outof-vocabulary tokens, different surface forms, and other information not contained in the token embeddings.

This is also mentioned in their ablation analysis. image

Removal of character embedding results in significant drop in the model's performance.

This should be the reason for such a big improvement in morphologically rich language e.g. Icelandic.

Alternate approach:

There's another approach to utilize the subword information: https://arxiv.org/abs/1607.04606 Here they have created vector embeddiing for the subwords. And word embedding is computed as sum of these subword embeddings.

Quoting from the paper:

Most of these techniques represent each word of the vocabulary by a distinct vector, without parameter sharing. In particular, they ignore the internal structure of words, which is an important limitation for morphologically rich languages, such as Turkish or Finnish. For example, in French or Spanish, most verbs have more than forty different inflected forms, while the Finnish language has fifteen cases for nouns. These languages contain many word forms that occur rarely (or not at all) in the training corpus, making it difficult to learn good word representations. Because many word formations follow rules, it is possible to improve vector representations for morphologically rich languages by using character level information.