Franck-Dernoncourt / NeuroNER

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.
http://neuroner.com
MIT License
1.7k stars 475 forks source link

How to approach importing a new dataset to train a NN with custom entities in NeuroNER #19

Open icarusin opened 7 years ago

icarusin commented 7 years ago

Hi Franck,

Thanks for building this abstraction on top of TensorFlow to make it easier to apply NER. Do you have some pointers on how to convert "un-annotated" text with custom annotations existing in a separate file (labels with offsets in to the main file) to CoNLL or BRAT format such that it can be used to train a NN? The entities that I am interested in are not the standard ones but custom to a domain (names of specific models of cars). Also, the custom annotations that exist in another file donot include any POS or Coreference tags.

I have several thousand of these "un-annotated" text files so a manual annotation process (such as by using BRAT) is not feasible. This is not directly related to the NeuroNER as an issue but your suggestions in what would be the best approach to convert these to a format that could be used to prepare a training dataset for NeuroNER would be very helpful.

Thanks, Ar

heri commented 7 years ago

What about using Amazon Mechanical Turk?

ngarneau commented 7 years ago

@icarusin, starting a NER project without labeled data is a challenging task. When your domain is really specific and contains several uncommon words, random annotators won't get the job done well and you'll lose time and money so, in my opinion, I wouldn't go with Amazon Mechanical Turk.

One thing you could do to reduce your time of annotating documents is trying a bootstrapping approach as the one proposed by Manning & Gupta.

Once you have some "pre-annotations" you can fix them quickly then train a classifier on this small dataset and then iterate over and over and enlarge you annotated corpus. Hope this helps,

Nicolas