Ukrainian language support in Flair

flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)

https://flairnlp.github.io/flair/

Other

13.75k stars 2.08k forks source link

Ukrainian language support in Flair #2985

Open alanakbik opened 1 year ago

alanakbik commented 1 year ago

This issue tracks the progress of adding support for the Ukrainian language from lang-uk to Flair. We would like to add:

[x] Ukrainian Flair embeddings trained by @dchaplinsky and available here: forward and backward. Should be made loadable with embeddings = FlairEmbeddings('uk-forward')and embeddings = FlairEmbeddings('uk-backward')
[x] Ukrainian NER by @dchaplinsky, available here. Should be made loadable with tagger = SequenceTagger.load('ner-ukrainian')
[x] Ukrainian part-of-speech tagger by @dchaplinsky, available here. Should be made loadable with tagger = SequenceTagger.load('pos-ukrainian')
[x] Ukrainian NER dataset described here. Loadable as corpus = NER_UKRAINIAN(). Should be integrated only once version 2.0 is complete.
[x] Ukrainian Universal Dependency Treebank, loadable as corpus = UD_UKRAINIAN().

dchaplinsky commented 1 year ago

This is the code for the NER corpus I've used: https://github.com/lang-uk/flair-ner/blob/main/train_base.py#L32

and the code for the POS corpus: https://github.com/lang-uk/flair-pos/blob/main/train_grid.py#L21

I'll take a look if I have fixed split for ner hosted somewhere else

stefan-it commented 1 year ago

Really cool idea!

I had to do a lot of manual preprocessing steps to get NER working when evaluating the ELECTRA model:

https://github.com/stefan-it/ukrainian-electra/blob/main/download_prepare_data_ner.sh

dchaplinsky commented 1 year ago

Oh, @stefan-it thanks for reminding me. Totally forgot about fixed split.

On a separate topic. Would you like to try to train electra on a better quality ukrainian texts?

stefan-it commented 1 year ago

Hey @dchaplinsky , I currently have access to TPUs, so if you have texts available I would love to pretrain another model :hugs:

dchaplinsky commented 1 year ago

Yes I do! Could you contact me at chaplinsky[dot]dmitry on gmail?

dchaplinsky commented 1 year ago

Hi @alanakbik and @stefan-it

I've just uploaded two bigger models for the Ukrainian language: https://huggingface.co/lang-uk/flair-uk-forward-large https://huggingface.co/lang-uk/flair-uk-backward-large

Those has hidden_size=2048 (in contrast to the 1024 of the original ones) and trained on my data + data from Stefan (54gb in total).

I've also trained a downstream NER model on them, and got a nice 1.5% improvement over the previous one, will publish it shortly.

stale[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.