Hironsan / anago

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.
https://anago.herokuapp.com/
MIT License
1.48k stars 371 forks source link

ValueError: max() arg is an empty sequence #62

Closed Rowing0914 closed 6 years ago

Rowing0914 commented 6 years ago

When I run the code like below. I've got stack at the titled error. why??

Using TensorFlow backend. 2018-05-22 11:47:25.286883: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA Epoch 1/15 Traceback (most recent call last): File "test.py", line 9, in model.train(x_train, y_train, x_valid, y_valid) File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/wrapper.py", line 50, in train trainer.train(x_train, y_train, x_valid, y_valid) File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/trainer.py", line 51, in train callbacks=callbacks) File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper return func(*args, **kwargs) File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/keras/engine/training.py", line 2145, in fit_generator generator_output = next(output_generator) File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/keras/utils/data_utils.py", line 770, in get six.reraise(value.class, value, value.traceback) File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/six.py", line 693, in reraise raise value File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/keras/utils/data_utils.py", line 635, in _data_generator_task generator_output = next(self._generator) File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/reader.py", line 137, in data_generator yield preprocessor.transform(X, y) File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/preprocess.py", line 115, in transform sents, y = self.pad_sequence(words, chars, y) File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/preprocess.py", line 148, in pad_sequence word_ids, sequence_lengths = pad_sequences(word_ids, 0) File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/preprocess.py", line 197, in pad_sequences max_length = len(max(sequences, key=len)) ValueError: max() arg is an empty sequence

import anago
from anago.reader import load_data_and_labels

x_train, y_train = load_data_and_labels('./data/train.txt')
x_valid, y_valid = load_data_and_labels('./data/valid.txt')
x_test, y_test = load_data_and_labels('./data/test.txt')

model = anago.Sequence()
model.train(x_train, y_train, x_valid, y_valid)
model.eval(x_test, y_test)
words = 'President Obama is speaking at the White House.'.split()
model.analyze(words)
Rowing0914 commented 6 years ago

When I tried with small dataset, this caused me such an error above though, if I fed the huge data, like the one you pushed on this repo, then it works. so, did you set any limitations on data storage??

Rowing0914 commented 6 years ago

Does anyone know about this??

Hironsan commented 6 years ago

Probably, sentences is an empty list:

>>> max([], key=len)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
ValueError: max() arg is an empty sequence
jannikbertram commented 6 years ago

Maybe you cut your off your data at a wrong point? Check the last rows of train.txt and valid.txt and make sure there is an empty line in the end and the last sentences are complete (a sentence is marked by an empty line after)

Rowing0914 commented 6 years ago

Hi Hironsan and bode94 Thank you for your comment though, i know that... the thing is I didn't know why this caused me such an error. anyway, bode94 is right. I didn't put the empty line at the end of the training data. that's why if I use the distributed dataset, it works though, when it comes to mine, it did't work... Thank you, both!

Hope you are doing well!

Best, Rowing0914

Rowing0914 commented 6 years ago

hmm, even though, I put a empty line at the end of the training data. The issue was not solved.. I think i totally impersonate the original training data given in the directory.

Traceback (most recent call last):
  File "test.py", line 9, in <module>
    model.train(x_train, y_train, x_valid, y_valid)
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/wrapper.py", line 50, in train
    trainer.train(x_train, y_train, x_valid, y_valid)
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/trainer.py", line 51, in train
    callbacks=callbacks)
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/keras/engine/training.py", line 2213, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/keras/callbacks.py", line 76, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/metrics.py", line 124, in on_epoch_end
    for i, (data, label) in enumerate(self.valid_batches):
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/reader.py", line 150, in data_generator
    yield preprocessor.transform(X, y)
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/preprocess.py", line 115, in transform
    y = [[self.vocab_tag[t] for t in sent] for sent in y]
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/preprocess.py", line 115, in <listcomp>
    y = [[self.vocab_tag[t] for t in sent] for sent in y]
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/preprocess.py", line 115, in <listcomp>
    y = [[self.vocab_tag[t] for t in sent] for sent in y]
KeyError: 'I-MISC'
import anago
from anago.reader import load_data_and_labels

x_train, y_train = load_data_and_labels('../data/conll2003/en/ner/train_1.txt')
x_valid, y_valid = load_data_and_labels('../data/conll2003/en/ner/valid_1.txt')
x_test, y_test = load_data_and_labels('../data/conll2003/en/ner/test_1.txt')

model = anago.Sequence()
model.train(x_train, y_train, x_valid, y_valid)
model.eval(x_test, y_test)
words = 'President Obama is speaking at the White House.'.split()
model.analyze(words)

train.txt EU B-ORG rejects O German B-MISC call O to O boycott O British B-MISC lamb O . O

Peter B-PER Blackburn I-PER

BRUSSELS B-LOC 1996-08-22 O

The O European B-ORG Commission I-ORG said O on O Thursday O it O disagreed O with O German B-MISC advice O to O consumers O to O shun O British B-MISC lamb O until O scientists O determine O whether O mad O cow O disease O can O be O transmitted O to O sheep O . O

Germany B-LOC 's O representative O to O the O European B-ORG Union I-ORG 's O veterinary O committee O Werner B-PER Zwingmann I-PER said O on O Wednesday O consumers O should O buy O sheepmeat O from O countries O other O than O Britain B-LOC until O the O scientific O advice O was O clearer O . O

" O We O do O n't O support O any O such O recommendation O because O we O do O n't O see O any O grounds O for O it O , O " O the O Commission B-ORG . O

so tell me the proper format for the dataset. There is no description on it.

jannikbertram commented 6 years ago

This error happens when your validation set contains tags that are not existent in your training set.

As this is a possible case in other kinds of machine learning problem, I build a workaround for it:

I defined a new Proprocessing class that includes tags from validation set into self.vocab_tag list.

class Preprocessor(WordPreprocessor):

    def fit(self, x_train, y_train, y_valid):
        super().fit(x_train, y_train)

        entities = set()

        for sent in y_valid:
            entities.update(sent)

        for t in entities:

            if t not in self.vocab_tag:
                self.vocab_tag[t] = len(self.vocab_tag)

        return self

You also need a new wrapper class that is almost equivalent to Sequence, but uses your new preprocessor:

class AnagoWrapper(Sequence):

    def train(self, x_train, y_train, x_valid=None, y_valid=None, vocab_init=None):
        self.p = Preprocessor(vocab_init=vocab_init).fit(x_train, y_train, y_valid)
        embeddings = filter_embeddings(self.embeddings, self.p.vocab_word, self.model_config.word_embedding_size)
        self.model_config.vocab_size = len(self.p.vocab_word)
        self.model_config.char_vocab_size = len(self.p.vocab_char)

        self.model = SeqLabeling(self.model_config, embeddings, len(self.p.vocab_tag))

        trainer = Trainer(self.model,
                          self.training_config,
                          checkpoint_path=self.log_dir,
                          preprocessor=self.p)
        trainer.train(x_train, y_train, x_valid, y_valid)

Anyway, I am thinking about changing my preproccesor by taking a predefined list of tags into the self.vocab_tag list as this may error once you test your model and your test set contains tags that are not existens in training or validation set.

Rowing0914 commented 6 years ago

Hi bode94

Thank you for your prompt action! Oh,, yeah it's probably i just have created the datasets using head -n 100 train/test/valid.txt > train/test/valid_1.txt

So within the first 100 lines, maybe each text contains other parts... Now I got it!! Thank you so much for your contribution as well! let me check!