flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.86k stars 2.1k forks source link

Using Pre-trained ELMo Representations for Many Languages in Flair #438

Closed mauryaland closed 4 years ago

mauryaland commented 5 years ago

Hello,

First of all, thanks for the great work. This library is very useful and I follow with attention the many improvements!

I wonder if there is a possibility of implementing elmo embeddings from the repository ELMoForManyLangs ?

Thank you in advance for your answer and I am available to help if the answer is positive.

Amaury

alanakbik commented 5 years ago

Hello @mauryaland this looks very interesting - multilingual ELMo would definitely be a great addition to Flair. Are you planning an installable pip package?

mauryaland commented 5 years ago

@alanakbik I will ask them if they are planning to or if I can create it. I let you know about that.

alanakbik commented 5 years ago

Cool, thanks!

mauryaland commented 5 years ago

@alanakbik I had an answer from the author on the following issue and the project is too much unstable so far, so wait and see. I will follow the topic.

On another side, I have discovered great embeddings recently, subwords embeddings for many languages in fact. It is called BPEmb. Could be interesting to used and it is available on pypi. Things are going really fast in NLP these days !

alanakbik commented 5 years ago

Wow these look interesting - perhaps we can integrate them.

stefan-it commented 5 years ago

Would be great if they had used the official ELMo training code 😂 From myexperience the Transformer ELMo model is a good alternative to the default ELMo model, and training is a lot faster.

BPE embeddings could be interesting, it would work easily for text classification, because you only have to bpe encode the sentences and train a baseline model; if you also want to use a language model, then you need to train on a BPE encoded corpus.

For sequence tagging: I made some experiments with SentencePiece (SentencePiece word embeddings + SentencePiece language model + Converting a CoNLL NER dataset also to SentencePiece) but the results weren't very promising (It's an open question how to tag the Pieces...).

But for text classification I think it is worth to try these BPEmb. E.g. there's a subword variant with SentencePiece in combination with the ULMfit model on Polish, see paper here.

mauryaland commented 5 years ago

Thanks for the details.

Indeed, how to tag the pieces for sequence tagging seems tricky.

Appreciate the paper on SentencePiece in combination with the ULMfit model, really good results!

bheinzerling commented 5 years ago

About tagging pieces: Converting token-based tags to subword-based tags is not necessary.

Instead, after having run your encoder (LSTM, ELMo, BERT...) on the subword sequence, you simply pick one encoder state for each token, e.g. the state corresponding to the first subword in each token. This is described in some detail with example code here.

stefan-it commented 5 years ago

Thanks for that hint @bheinzerling . Your link also includes a very nice reference to another discussion about the NER result in the BERT paper (and document vs sentence context) :+1:

alanakbik commented 5 years ago

@bheinzerling just had a look at your BPEmb paper - this looks really interesting and could allow us to reduce model size (as you noted, fastText embeddings are huge). So we'll definitely take a look at integrating your embeddings!

gccome commented 5 years ago

@bheinzerling is there a way to train BPEmb with our own data? Thanks in advance!

Update, by reading your BPEmd paper, I figure out a way of doing so. First use SentencePiece to train a BPE, and then use Glove or Word2Vec to train a BPEmd.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.