flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.75k stars 2.08k forks source link

Polish language support #187

Closed dkajtoch closed 4 years ago

dkajtoch commented 5 years ago

Hi, Thanks for a great package! I ma not sure if the current version has support for Polish language. I can see word embedding, but 'ner' tagger is unavailable. Are you planning to develop methods for Polish?

alanakbik commented 5 years ago

Hi @dkajtoch, yes Polish is one of the main languages we are currently looking at and want to add to the project very soon. We've already added the character LM embeddings that (Borchmann et. al, 2018) computed to the project, and in a recent branch also FastText embeddings for Polish.

I am hoping that very soon we can also add pre-trained models for Polish NER and PoS tagging, but of course any help is appreciated :)

dkajtoch commented 5 years ago

Thanks @alanakbik! It would be great to collect all the pieces in one place. They are now scattered around the web http://clip.ipipan.waw.pl/LRT

dkajtoch commented 5 years ago

One more thing @alanakbik. Are you going to use supervised learning approach to POS tagging and NER? If so, how do you get high quality learning material?

alanakbik commented 5 years ago

For PoS tagging, we want to use the universal dependencies datasets. For NER, either the data used for PolEval-2018, or the auto-generated dataset from wikiner. But we don't have much experience with Polish datasets so far - do you have any suggestions?

alanakbik commented 5 years ago

Just a quick heads up: we've just pushed an update into master that includes pre-trained FastText embeddings for Polish. You can load them with:

embeddings = WordEmbeddings('pt')

For now only available through master branch, but we're planning another version release (0.3.2) in a few days - then they'll also be available if you install from pip.

dkajtoch commented 5 years ago

Great news @alanakbik! Sorry, but I do not have much experience with polish datasets either. Maybe, the resources you provided are good starting point.

alanakbik commented 5 years ago

@dkajtoch we've just release flair 0.4.0 which contains models for Polish word embeddings and part-of-speech tagging. Here is how you load word embeddings:

from flair.embeddings import FlairEmbeddings, WordEmbeddings

# Polish word embeddings
polish_flair_embeddings = FlairEmbeddings('polish-forward')
polish_word_embeddings = WordEmbeddings('pl')

And here is how you parse a Polish sentence with the multilingual POS tagger:

from flair.data import Sentence
from flair.models import SequenceTagger

# multilingual PoS tagger, but was trained including Polish data
polish_pos_tagger = SequenceTagger.load('pos-multi')

sentence = Sentence('Ola szykuje się do szkoły .')
polish_pos_tagger.predict(sentence)

print(sentence.to_tagged_string())

Hope this helps!

dkajtoch commented 5 years ago

That is great! I am looking forward to try it :) did you benchmark it on some dataset?

mzalevski commented 5 years ago

Hi @alanakbik, firstly - great work! Flair is an amazing tool. However I have a problem with Polish language and I hope that maybe you could give me some advice.

In Readme.md you claim that NER in Polish is supported and you achive very good results: polish_lang_flair

but you do not provide an explanation in "Best Configurations per Dataset" page and in the "Tutorial 2: Tagging your Text" page you say that "The NER models are trained over 4 languages (English, German, Dutch and Spanish) and the PoS models over 12 languages (English, German, French, Italian, Dutch, Polish, Spanish, Swedish, Danish, Norwegian, Finnish and Czech)."

afaic this should mean that Polish is supported only for PoS and I should not seek a great succes with it with NER; how is this high score (86.6) achived then? Please bear with me here, as I am a beginner in NLP. I would be grateful if you could help me here :)

mzalevski commented 5 years ago

@alanakbik Sorry to bother you again, but I've just realised, that the answer was in front of me all the time; https://github.com/applicaai/poleval-2018 provides an explanation :)

alanakbik commented 5 years ago

Hi @maciej-zalewski great, the authors put a lot of details into their repo and paper, and will probably be happy to provide all other details you need!

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.