Franck-Dernoncourt / NeuroNER

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.
http://neuroner.com
MIT License
1.7k stars 475 forks source link

Usage of CoNLL-03 values #126

Closed svanhvitlilja closed 5 years ago

svanhvitlilja commented 6 years ago

Hi! We're working on a named entity recognizer for Icelandic, using NeuroNER and an annotated training corpus.

As there is no support for Icelandic in Spacy or the Stanford NLP tools, we ran into a problem when running NeuroNER on our data in brat format (error appears when tokenizing using spacy in brat_to_conll.py)

Our question is: Can we bypass using Spacy altogether by formatting our data in conll-03 ourselves, using available Icelandic NLP resources? And to what extent are the conll values used in NeuroNER?

Gregory-Howard commented 6 years ago

Hi, You need to check this code : https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/src/brat_to_conll.py#L20 and understand what it does. Then replace it with a Icelandic tokenizer/sentence segmenter

Gregory-Howard commented 6 years ago

It's not easy to do it could take a lot of time

svanhvitlilja commented 5 years ago

Thanks, we changed the source code to use our own tokenizing method :)