Support training Stanford NER model

dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.

https://dkpro.github.io/dkpro-core

Other

196 stars 67 forks source link

Support training Stanford NER model #1000

Closed neumannm closed 7 years ago

neumannm commented 7 years ago

Analogous to the OpenNlpNamedEntityRecognizerTrainer, it would be nice to also have a component for training NER models for Stanford CoreNLP.

The important aspects are:

configuration is normally provided via a properties file --> I think it should stay this way, just use the normal source, target etc. as configuration parameters for the component, and then a paramater for config file location
normally, the training files are expected to be in tsv format, one column of tokens, one column of NER tags (other columns possible, mappings of columns again specified in the config file); and for the tags several variants of IOB, IOB2, SBIEO etc. are possible --> but I'm not sure, I think when integrating in DKPro Core, training data should be in CAS right?

Btw here is the FAQ to Stanford NER training.

reckart commented 7 years ago

wrt properties file: Normally, we would have not have a configuration file - all parameters would be on the component itself. It might internally generate a file or pass the settings on directly to the underlying training code.

wrt training files: Normally, we would extract the data from the CAS and pass it on directly to the training code without writing them to a file first.

However, doing the above likely involves quite a bit of work given the way that the Stanford CRF is implemented. For that reason, you might prefer implementing your training component in such a way that the properties file is passed as a parameter and the training data is written out to a temporary file.

neumannm commented 7 years ago

Thanks @reckart for your comments. Regarding the properties, it would not be much work including all parameters in the component - just many many lines of code to add because there are not less than 95 parameters recognized by the StanfordNER Trainer.

Regarding the training data, I will do as you suggested.

reckart commented 7 years ago

@neumannm wow, that's a lot ;) Maybe start with allowing to specify a properties file and later we could expose parameters that are commonly changed directly as parameters. It would be nice though if the component would assume some defaults (e.g. for English NER) if no properties file is specified at all.

reckart commented 7 years ago

Is there more to do on this issue at the moment?

neumannm commented 7 years ago

I don't think so. Thanks for your fixes btw. I hope that some people will use this component and if there are problems with it I think they will submit new issues.

reckart commented 7 years ago

Ok, cool. Then I'll close this issue.