allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.76k stars 2.25k forks source link

Question: how to train NER over new datasets #2723

Closed matanox closed 5 years ago

matanox commented 5 years ago

I'm new to AllenNLP, and I was considering unleashing its NER training algorithm on new datasets. I'm a little reluctant about doing so however, as I haven't found how to use the API for that. Can you possibly point me at sample code for that? have you recently gone through this procedure and have any comments to its stability and resource consumption/duration ballparks?

matt-gardner commented 5 years ago

To train a model, take a configuration file (like the one that we have for our NER model), modify the paths in it to point to your data, and run allennlp train CONFIG_FILE -s PLACE_TO_SAVE_RESULTS.

arsalan993 commented 5 years ago

@matt-gardner What you please share a samples training file.. i want to see the acceptable format and required fields .. Also i the config file it says "tokens": { "type": "embedding", "embedding_dim": 50, "pretrained_file": "https://s3-us-west-2.amazonaws.com/allennlp/datasets/glove/glove.6B.50d.txt.gz", "trainable": true },

so can i remove these two line since i dont want to use a pretrained model "pretrained_file": "https://s3-us-west-2.amazonaws.com/allennlp/datasets/glove/glove.6B.50d.txt.gz", "trainable": true

Further more i my model to be train on elmo embeddings but dont want to use any pre-trained model

matt-gardner commented 5 years ago

Example file: https://github.com/allenai/allennlp/blob/master/allennlp/tests/fixtures/data/conll2003.txt. You can see from the training config I pointed to that the dataset reader is a conll2003 reader: https://github.com/allenai/allennlp/blob/9dec020281ee9521e7f1ffd696bcbb102c399703/training_config/ner.jsonnet#L4-L7 If you look at our documentation (or the source code) you can see what the expected file format is, and if you look at our tests, you can see the fixture that is used for the test.

If you don't want pre-trained embeddings, you can remove those two lines, yes. For ELMo, look at our tutorial: https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md.

arsalan993 commented 5 years ago

Example file: https://github.com/allenai/allennlp/blob/master/allennlp/tests/fixtures/data/conll2003.txt. You can see from the training config I pointed to that the dataset reader is a conll2003 reader:

allennlp/training_config/ner.jsonnet

Lines 4 to 7 in 9dec020

"dataset_reader": { "type": "conll2003", "tag_label": "ner", "coding_scheme": "BIOUL", If you look at our documentation (or the source code) you can see what the expected file format is, and if you look at our tests, you can see the fixture that is used for the test. If you don't want pre-trained embeddings, you can remove those two lines, yes. For ELMo, look at our tutorial: https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md.

if you take a look at RASA implementation https://rasa.com/docs/nlu/evaluation/ they donot intent to use BIOUL annotating scheme and the reason they have explained in the link. Can we enable such behavior in allenai implementation by doing some changes in config file.

matt-gardner commented 5 years ago

If your input data doesn't match the dataset readers that we have implemented, you could pretty easily write your own that matches the input format you have.

saharghannay commented 4 years ago

Hi, the tutorial https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md is not working, could you five us the new link please

saharghannay commented 4 years ago

Hi, the tutorial https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md. is not working

dellielo commented 4 years ago

Hi, the tutorial https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md. is not working

It is here in docs now : https://github.com/allenai/allennlp/blob/master/docs/tutorials/how_to/elmo.md

mayhewsw commented 4 years ago

FWIW, the path for training config files has changed, and is now here.