Hironsan / anago

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.
https://anago.herokuapp.com/
MIT License
1.48k stars 368 forks source link

No option to train on the whole dataset, must split and provide x_valid, y_valid #40

Closed peustr closed 6 years ago

peustr commented 6 years ago

Hi,

I am trying to train an NER model for which I want to use the entire dataset in my possession. Right now in the train method it looks like the the x_valid and y_valid arguments are optional. However, if I leave them as None and don't pass them at all, I get the following error during training:

TypeError: object of type 'NoneType' has no len()

Which comes from the batch_iter method:

    113 
    114 def batch_iter(data, labels, batch_size, shuffle=True, preprocessor=None):
--> 115     num_batches_per_epoch = int((len(data) - 1) / batch_size) + 1
    116 
    117     def data_generator():

Using a validation set is useful when tuning the hyperparameters of the model, however once this is done, how can I train the final model on the entire dataset without having to split it?

dterg commented 6 years ago

You can provide anything as validation set. It won't effect the training/weights or predictions. Alternatively (to save on computational time), in the train method for the Trainer class (in the trainer.py), you can put the batch_iter for the validation set within an if statement as follows:

train_steps, train_batches = batch_iter(x_train, y_train, self.training_config.batch_size, preprocessor=self.preprocessor)

if x_valid and y_valid:
     valid_steps, valid_batches = batch_iter(x_valid, y_valid, self.training_config.batch_size, 
     preprocessor=self.preprocessor)
Hironsan commented 6 years ago

In anaGo 1.0.0, this problem is solved. Thanks!

Example:

model = anago.Sequence()
model.fit(x_train, y_train)
peustr commented 6 years ago

Thank you!