Franck-Dernoncourt / NeuroNER

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.
http://neuroner.com
MIT License
1.69k stars 477 forks source link

Steps to utilize NeuroNER for other languages #30

Open sooheon opened 7 years ago

sooheon commented 7 years ago

It appears that BART at least is pretty language agnostic. The English specific parts of NeuroNER (afaict), are the recommended glove.6B.100d word vectors, and all of the spacy related tokenizing code, which is used to translate BART format into CoNLL format (correct?)

Am I correct that if I:

  1. Supply Korean word vectors in /data/word_vectors
  2. Supply CoNLL formatted train, valid, and test data using BART labeled Korean text which I run through my own tokenizer

I will be able to train and use NeuroNER for Korean text?

Franck-Dernoncourt commented 7 years ago

Correct! Note that providing word vectors is optional (it's typically better if you have some), and that I haven't tested NeuroNER with languages other than English. I know someone successfully used it in French (after an encoding fix PR :)), and someone was supposed to try with Bangladeshi but I haven't heard back from him.

On Jul 3, 2017 9:49 PM, "Sooheon Kim" notifications@github.com wrote:

It appears that BART at least is pretty language agnostic. The English specific parts of NeuroNER (afaict), are the recommended glove.6B.100d word vectors, and all of the spacy related tokenizing code, which is used to translate BART format into CoNLL format (correct?)

Am I correct that if I:

  1. Supply Korean word vectors in /data/word_vectors
  2. Supply CoNLL formatted train, valid, and test data using BART labeled Korean text which I run through my own tokenizer

Will I be able to train and use NeuroNER for Korean text?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Franck-Dernoncourt/NeuroNER/issues/30, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA7447RV_hPWNxKIwrgUb6oHxSekvLUks5sKahDgaJpZM4OM1FP .

Gregory-Howard commented 7 years ago

Hi (I'm the guy who uses NeuroNER in French)! These 2 steps are true, but you also need spacy (or nltk) working in Korean. I'm explaining a bit more for SpaCy : You need a SpaCy Korean model. This consist in a tokenizer and a POS Tagging model. Someone asked exactly this question : https://github.com/explosion/spaCy/issues/929 Then you will have to change spacylanguage in parameter.ini I hope I'm clear, if not, feel free to ask.

Steps (for spacy) language : X:

sooheon commented 7 years ago

Thanks for the additional detail! That looks perfectly doable.

ersinyar commented 6 years ago

I don't understand what exactly spacy (or nltk) does in NeuroNER. I think spacy is used as tokenizer. Do we need a language specific tokenizer? And also why do we need POS tagging model? Can't we just use nltk for tokenization?

Gregory-Howard commented 6 years ago

Spacy is used in this file : https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/src/brat_to_conll.py#L20 The problem here is for span in document.sents: this method need a model to works. I think if we transform a bit the code we might just need a tokenizer.

Killthebug commented 6 years ago

Hey all! I'm trying to get NeuroNER to work for some Hindi data, but from what I understand spaCy does not support Hindi.

Would you recommend I user NLTK for the same because from what I gather, (spaCy or NLTK) is primarily used from sentence splitting and tokenizing here : https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/src/brat_to_conll.py#L20

svanhvitlilja commented 5 years ago

Hi! As you seem to be the people who have the most experience in using NeuroNER for languages other than English, could I please ask you to take a look at my query regarding Icelandic?

Unfortunately Spacy, Stanford and NLTK don't support Icelandic, so we need to find a way to use NeuroNER by relying on available NLP tools for Icelandic. Thanks a lot! Issue: #126

Peacelover01 commented 4 years ago

Can we use the NeuroNER model for Urdu language, spacy does't support Urdu language. Also can we use other word embedding like Facebook fasttext.

svanhvitlilja commented 4 years ago

You can use your own tokenizer, and bypass spacy by changing a few lines in the source code, we did that for Icelandic. I can give you some pointers if you want. Don't know about the other embeddings, would like to know :)

Peacelover01 commented 4 years ago

Thank you @svanhviti16 for your reply. It will be highly appreciated.