Closed Rowing0914 closed 6 years ago
When I tried with small dataset, this caused me such an error above though, if I fed the huge data, like the one you pushed on this repo, then it works. so, did you set any limitations on data storage??
Does anyone know about this??
Probably, sentences
is an empty list:
>>> max([], key=len)
Traceback (most recent call last):
File "<input>", line 1, in <module>
ValueError: max() arg is an empty sequence
Maybe you cut your off your data at a wrong point?
Check the last rows of train.txt
and valid.txt
and make sure there is an empty line in the end and the last sentences are complete (a sentence is marked by an empty line after)
Hi Hironsan and bode94 Thank you for your comment though, i know that... the thing is I didn't know why this caused me such an error. anyway, bode94 is right. I didn't put the empty line at the end of the training data. that's why if I use the distributed dataset, it works though, when it comes to mine, it did't work... Thank you, both!
Hope you are doing well!
Best, Rowing0914
hmm, even though, I put a empty line at the end of the training data. The issue was not solved.. I think i totally impersonate the original training data given in the directory.
Traceback (most recent call last):
File "test.py", line 9, in <module>
model.train(x_train, y_train, x_valid, y_valid)
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/wrapper.py", line 50, in train
trainer.train(x_train, y_train, x_valid, y_valid)
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/trainer.py", line 51, in train
callbacks=callbacks)
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/keras/engine/training.py", line 2213, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/keras/callbacks.py", line 76, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/metrics.py", line 124, in on_epoch_end
for i, (data, label) in enumerate(self.valid_batches):
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/reader.py", line 150, in data_generator
yield preprocessor.transform(X, y)
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/preprocess.py", line 115, in transform
y = [[self.vocab_tag[t] for t in sent] for sent in y]
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/preprocess.py", line 115, in <listcomp>
y = [[self.vocab_tag[t] for t in sent] for sent in y]
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/preprocess.py", line 115, in <listcomp>
y = [[self.vocab_tag[t] for t in sent] for sent in y]
KeyError: 'I-MISC'
import anago
from anago.reader import load_data_and_labels
x_train, y_train = load_data_and_labels('../data/conll2003/en/ner/train_1.txt')
x_valid, y_valid = load_data_and_labels('../data/conll2003/en/ner/valid_1.txt')
x_test, y_test = load_data_and_labels('../data/conll2003/en/ner/test_1.txt')
model = anago.Sequence()
model.train(x_train, y_train, x_valid, y_valid)
model.eval(x_test, y_test)
words = 'President Obama is speaking at the White House.'.split()
model.analyze(words)
train.txt EU B-ORG rejects O German B-MISC call O to O boycott O British B-MISC lamb O . O
Peter B-PER Blackburn I-PER
BRUSSELS B-LOC 1996-08-22 O
The O European B-ORG Commission I-ORG said O on O Thursday O it O disagreed O with O German B-MISC advice O to O consumers O to O shun O British B-MISC lamb O until O scientists O determine O whether O mad O cow O disease O can O be O transmitted O to O sheep O . O
Germany B-LOC 's O representative O to O the O European B-ORG Union I-ORG 's O veterinary O committee O Werner B-PER Zwingmann I-PER said O on O Wednesday O consumers O should O buy O sheepmeat O from O countries O other O than O Britain B-LOC until O the O scientific O advice O was O clearer O . O
" O We O do O n't O support O any O such O recommendation O because O we O do O n't O see O any O grounds O for O it O , O " O the O Commission B-ORG . O
so tell me the proper format for the dataset. There is no description on it.
This error happens when your validation set contains tags that are not existent in your training set.
As this is a possible case in other kinds of machine learning problem, I build a workaround for it:
I defined a new Proprocessing class that includes tags from validation set into self.vocab_tag list.
class Preprocessor(WordPreprocessor):
def fit(self, x_train, y_train, y_valid):
super().fit(x_train, y_train)
entities = set()
for sent in y_valid:
entities.update(sent)
for t in entities:
if t not in self.vocab_tag:
self.vocab_tag[t] = len(self.vocab_tag)
return self
You also need a new wrapper class that is almost equivalent to Sequence, but uses your new preprocessor:
class AnagoWrapper(Sequence):
def train(self, x_train, y_train, x_valid=None, y_valid=None, vocab_init=None):
self.p = Preprocessor(vocab_init=vocab_init).fit(x_train, y_train, y_valid)
embeddings = filter_embeddings(self.embeddings, self.p.vocab_word, self.model_config.word_embedding_size)
self.model_config.vocab_size = len(self.p.vocab_word)
self.model_config.char_vocab_size = len(self.p.vocab_char)
self.model = SeqLabeling(self.model_config, embeddings, len(self.p.vocab_tag))
trainer = Trainer(self.model,
self.training_config,
checkpoint_path=self.log_dir,
preprocessor=self.p)
trainer.train(x_train, y_train, x_valid, y_valid)
Anyway, I am thinking about changing my preproccesor by taking a predefined list of tags into the self.vocab_tag list as this may error once you test your model and your test set contains tags that are not existens in training or validation set.
Hi bode94
Thank you for your prompt action! Oh,, yeah it's probably i just have created the datasets using head -n 100 train/test/valid.txt > train/test/valid_1.txt
So within the first 100 lines, maybe each text contains other parts... Now I got it!! Thank you so much for your contribution as well! let me check!
When I run the code like below. I've got stack at the titled error. why??
Using TensorFlow backend. 2018-05-22 11:47:25.286883: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA Epoch 1/15 Traceback (most recent call last): File "test.py", line 9, in
model.train(x_train, y_train, x_valid, y_valid)
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/wrapper.py", line 50, in train
trainer.train(x_train, y_train, x_valid, y_valid)
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/trainer.py", line 51, in train
callbacks=callbacks)
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/keras/engine/training.py", line 2145, in fit_generator
generator_output = next(output_generator)
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/keras/utils/data_utils.py", line 770, in get
six.reraise(value.class, value, value.traceback)
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/keras/utils/data_utils.py", line 635, in _data_generator_task
generator_output = next(self._generator)
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/reader.py", line 137, in data_generator
yield preprocessor.transform(X, y)
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/preprocess.py", line 115, in transform
sents, y = self.pad_sequence(words, chars, y)
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/preprocess.py", line 148, in pad_sequence
word_ids, sequence_lengths = pad_sequences(word_ids, 0)
File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/preprocess.py", line 197, in pad_sequences
max_length = len(max(sequences, key=len))
ValueError: max() arg is an empty sequence