Evaluating a domain specific ELMo on NER task

pvcastro commented 5 years ago

System:

OS: Linux
Python version: 3.6.3
AllenNLP version: v0.8.4
PyTorch version: 0.4.1

Question

I'm training ELMo on a legal domain, for Portuguese language. I trained a first version of the model a few months ago, but without filtering the vocabulary size. I used a corpus of 1.6 billion tokens, and ended up with a vocabulary of 2 million words, and a very high perplexity of 760.

Recently I did the training again, using an extended corpus of 6 billion tokens, filtered the vocabulary keeping only words that occurred at least 3 times, and eliminated some other 'trash' tokens (numbers, currency values, document ids, etc). I ended up with a final vocabulary of 774.340 words, and a new perplexity of 11.64.

Altough I was able to improve the model's perplexity, my evaluation of ELMo on a NER task (also focused on the legal domain) didn't improve so much. The average F-Score I got using both versions is pretty much the same. I am only able to verify an improvement comparing with the general domain ELMo. My average F-Score using general domain ELMo is 86% and my average using the legal ELMo is 88%.

I think I should be able to verify an improvement on the NER task, since I did a better cleaning of the vocabulary, and ended up using a corpus almost 4 times bigger. I used the default parameters from the bilm-tf repository.

Does it make any sense that the first model performs the same as the second, considering the "improved" preprocessing in the second time. How could I analyze this better?

Thanks!

joelgrus commented 5 years ago

I really don't know, @matt-peters do you have any ideas?

DeNeutoy commented 5 years ago

You can't compare perplexity across a language model with different vocabularies, as the data will be different. Your improvement seems consistent with what we've seen with domain specific elmo models too. I'm not sure we'll be able to help here, but I don't think there's too much to help with, as it sounds like your newer domain model is working well. Hope that's moderately helpful!

pvcastro commented 5 years ago

Hi @DeNeutoy , thanks for the reply!

What is disturbing me is that I didn't get any improvement on the NER downstream task after "perfecting" the language model.

Apparently there isn't a direct relation between language model perplexity and downstream tasks performance, but shouldn't I be getting a minimal improvement on my NER model from using an ELMo trained on such a larger corpus (4x larger), with such a better perplexity?

Thanks!

matt-peters commented 5 years ago

Maybe yes, maybe no. It all depends on how well the originally trained model fit your domain, how well the unsupervised training corpus matches the specific data in your NER dataset, and how well the LM representations transfer to your task. 1.6B tokens is still a large scale dataset for language models, so I wouldn't expect a huge improvement when moving to 6B tokens. Here's a data point that shows moving from 1.1 to 4.5B tokens only gives a marginal improvement on GLUE for a transformer with an objective similar to ELMo: https://arxiv.org/pdf/1903.07785.pdf

If you haven't already you could try training for longer on the large corpus as we'd expect to be able to take more gradient steps before beginning to overfit.

pvcastro commented 5 years ago

I see, thanks @matt-peters !

how well the unsupervised training corpus matches the specific data in your NER dataset,

The set of documents annotated for the NER task are a very small subset of the same types of documents used for the language model training. These documents are judges sentences, decisions and minutes of court hearings.

and how well the LM representations transfer to your task

With either of the two evaluated ELMos I've noticed that the fine-tuning on the NER corpus isn't making any difference, maybe it's because of the previous statement, that the documents are so strongly related, right?

If you haven't already you could try training for longer on the large corpus as we'd expect to be able to take more gradient steps before beginning to overfit.

I see, I'll try this!

What about the vocabulary? You once told me that you were just filtering words that occurred less than 3 times. Qiao Jin from BioELMo told me he was just using the 1M most frequent words from his corpus. I was under the impression that working on the vocabulary with a different type of pre-processing in addition to 1-billion-word could some how improve the language model and help the NER task, but maybe this would take me to another direction, such as using the transformer models that use BPE vocabularies, right?

matt-peters commented 5 years ago

Probably don't need a huge vocabulary, a few hundred thousand tokens should be sufficient, and perhaps less, although I don't recall seeing any papers that have looked specifically at this question.

allenai / allennlp

Evaluating a domain specific ELMo on NER task #3083