ELMO and MTL-different levels model and MTL model

Dragon615 commented 5 years ago

Hi nreimers,

You have a really nice and clean work code. In addition, I like your way of documenting your work. So, thank you so much for your nice contributions.

I have a question regarding using ELMO and the MTL models in your other repository, "Train_MultiTask_Different_Levels.py" and "Train_MultiTask.py". What are the needed changes to get these models to use ELMO representation?

Thanks in advance for your response.

nreimers commented 5 years ago

Using these examples with ELMo should be straightforward.

Supervision at the same level: Update the datasets variable (for example in Train_Chunking.py), for example, like this:

datasets = {
    'unidep_pos':
        {'columns': {1:'tokens', 3:'POS'},
         'label': 'POS',
         'evaluate': True,
         'commentSymbol': None},
    'conll2000_chunking':
        {'columns': {0:'tokens', 2:'chunk_BIO'},
         'label': 'chunk_BIO',
         'evaluate': True,
         'commentSymbol': None},
}

Supervision at different levels: Besides updating the datasets variable, you also need to update the params variable, for example, like this:

params = {'classifier': ['CRF'], 'LSTM-Size': [100], 'dropout': (0.25, 0.25),
          'customClassifier': {'unidep_pos': ['Softmax'], 'conll2000_chunking': [('LSTM', 50), 'CRF']}}

Dragon615 commented 5 years ago

Hi nreimers,

Can I use this framework with BERT? if not, are there any possible changes that I can add to the current framework to get BERT to work?

Thanks in advance for your response.

nreimers commented 5 years ago

Hi @Dragon615, replacing ELMo with BERT is not trivial in this framework. BERT works on sub-token units, i.e., the word 'San Francisco' might be separated into the tokens 'San Fran #cisco'. For sequence tagging, this is non-trivial as there is no tag for #cisco. In the BERT paper, they assigned a special token X for these sub-tokens.

Also, I am not convinced that BERT works well for e.g. NER. See: https://github.com/google-research/bert/issues/223

Me, and several others, tried to reproduce the NER results in the paper, but everyone failed. BERT achieved with the architecture described in the paper a result that is really far behind state-of-the-art (performance matches systems of about 2005).

Integrating BERT into a BiLSTM-CRF architecture was able to improve the performance, but it still lacks behind state-of-the-art.

I don't know if some was able to reproduce the NER results form the BERT paper so far. Also I don't know if BERT is a good choice for other sequence tagging tasks other than NER.

I can recommend to have a look at this project, which has BERT embeddings: https://github.com/zalandoresearch/flair

Best regards -Nils Reimers

Dragon615 commented 5 years ago

Thank you, Nils for your great clarification.

I noticed in one of the issues that it's possible to use word2vec pre-trained embedding with ELMO. it's not clear to me how can to combine or use ELMO and word2vec embeddings and which one of the embeddings will be used when training your model. For instance, if I have a word like "play" appears in multiple contexts, ELMO will produce an embedding for each word in each context, while word2vec will have one produce one embeddings for the word "play". Can you please explain to me briefly how to use both word2vec(or Glov) and ELMO embeddings? Also, are you fine-tuning the pre-trained embeddings while training the models?

Thanks,

nreimers commented 5 years ago

Hi @Dragon615 , yes, with the implementation you can decide: 1) Just use traditional word embeddings (like word2vec or GloVe) 2) Use ELMo 3) Use ELMo + traiditional word embeddings.

If you have the sentence for option 3: 'I play the guitar'

Then for word2vec each word is mapped to a fixed sized vector, for example, with 300 dim.

Also, the sentence is passed through the BiLSTM of ELMo. This generates 3 vectors, each 1024 dimensions. Depending on the weighting scheme of ELMo, these 3 vectors are merged. For example, you create an average of the 3 vectors. Giving you one 1024 vector for each word. Here, it is true that play gets a different embedding depending on the context.

This 1024 dim vector is concatenated with the 300 dim vector from step 1). This gives you one 1324 dimensional vector for play, 1024 dims from ELMo, 300 dims from word2vec.

Weights for word embeddings are not fine-tuned in this model: Usually, it decreases the performance (in my experience) and significantly increases training times.

Dragon615 commented 5 years ago

Thank you so much Nils!!

UKPLab / elmo-bilstm-cnn-crf

ELMO and MTL-different levels model and MTL model #26