How to approach importing a new dataset to train a NN with custom entities in NeuroNER

icarusin commented 7 years ago

Hi Franck,

Thanks for building this abstraction on top of TensorFlow to make it easier to apply NER. Do you have some pointers on how to convert "un-annotated" text with custom annotations existing in a separate file (labels with offsets in to the main file) to CoNLL or BRAT format such that it can be used to train a NN? The entities that I am interested in are not the standard ones but custom to a domain (names of specific models of cars). Also, the custom annotations that exist in another file donot include any POS or Coreference tags.

I have several thousand of these "un-annotated" text files so a manual annotation process (such as by using BRAT) is not feasible. This is not directly related to the NeuroNER as an issue but your suggestions in what would be the best approach to convert these to a format that could be used to prepare a training dataset for NeuroNER would be very helpful.

Thanks, Ar

heri commented 7 years ago

What about using Amazon Mechanical Turk?

ngarneau commented 7 years ago

@icarusin, starting a NER project without labeled data is a challenging task. When your domain is really specific and contains several uncommon words, random annotators won't get the job done well and you'll lose time and money so, in my opinion, I wouldn't go with Amazon Mechanical Turk.

One thing you could do to reduce your time of annotating documents is trying a bootstrapping approach as the one proposed by Manning & Gupta.

Once you have some "pre-annotations" you can fix them quickly then train a classifier on this small dataset and then iterate over and over and enlarge you annotated corpus. Hope this helps,

Nicolas

Franck-Dernoncourt / NeuroNER

How to approach importing a new dataset to train a NN with custom entities in NeuroNER #19