explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.21k stars 4.4k forks source link

Error parsing text using SpaCy 2.0 categorizer #1676

Closed adam612 closed 6 years ago

adam612 commented 6 years ago

Following the example of SpaCy's NLP package here:

https://github.com/explosion/spaCy/blob/master/examples/training/train_textcat.py

To use Keras for multiclass classification of texts, I received a strange error:

    ValueError                                Traceback (most recent call last)
    <ipython-input-26-fcb595f3a0ec> in <module>()
          9         for batch in batches:
         10             texts, annotations = zip(*batch)
    ---> 11             nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
         12         with textcat.model.use_params(optimizer.averages):
         13             scores = evaluate(nlp.tokenizer, textcat, dev_texts_i, l_test)

    /home/gpu1/anaconda3/lib/python3.6/site-packages/spacy/language.py in update(self, docs, golds, drop, sgd, losses)
        405                 continue
        406             grads = {}
    --> 407             proc.update(docs, golds, drop=drop, sgd=get_grads, losses=losses)
        408             for key, (W, dW) in grads.items():
        409                 sgd(W, dW, key=key)

    pipeline.pyx in spacy.pipeline.TextCategorizer.update()

    /home/gpu1/anaconda3/lib/python3.6/site-packages/thinc/api.py in begin_update(self, X, drop)
         59         callbacks = []
         60         for layer in self._layers:
    ---> 61             X, inc_layer_grad = layer.begin_update(X, drop=drop)
         62             callbacks.append(inc_layer_grad)
         63         def continue_update(gradient, sgd=None):

    /home/gpu1/anaconda3/lib/python3.6/site-packages/thinc/api.py in begin_update(X, *a, **k)
        174     def begin_update(X, *a, **k):
        175         forward, backward = split_backward(layers)
    --> 176         values = [fwd(X, *a, **k) for fwd in forward]
        177 
        178         output = ops.xp.hstack(values)

    /home/gpu1/anaconda3/lib/python3.6/site-packages/thinc/api.py in <listcomp>(.0)
        174     def begin_update(X, *a, **k):
        175         forward, backward = split_backward(layers)
    --> 176         values = [fwd(X, *a, **k) for fwd in forward]
        177 
        178         output = ops.xp.hstack(values)

    /home/gpu1/anaconda3/lib/python3.6/site-packages/thinc/api.py in wrap(*args, **kwargs)
        256     '''
        257     def wrap(*args, **kwargs):
    --> 258         output = func(*args, **kwargs)
        259         if splitter is None:
        260             to_keep, to_sink = output

    /home/gpu1/anaconda3/lib/python3.6/site-packages/thinc/api.py in begin_update(self, X, drop)
         59         callbacks = []
         60         for layer in self._layers:
    ---> 61             X, inc_layer_grad = layer.begin_update(X, drop=drop)
         62             callbacks.append(inc_layer_grad)
         63         def continue_update(gradient, sgd=None):

    /home/gpu1/anaconda3/lib/python3.6/site-packages/thinc/neural/_classes/attention.py in begin_update(self, Xs_lengths, drop)
         23     def begin_update(self, Xs_lengths, drop=0.):
         24         Xs, lengths = Xs_lengths
    ---> 25         attention, bp_attention = self._get_attention(self.Q, Xs, lengths)
         26         output, bp_output = self._apply_attention(attention, Xs, lengths)
         27 

    /home/gpu1/anaconda3/lib/python3.6/site-packages/thinc/neural/_classes/attention.py in _get_attention(self, Q, Xs, lengths)
         45                 attention[start+argmax] = 1.
         46             else:
    ---> 47                 self.ops.softmax(attention[start : start+length], inplace=True)
         48             start += length
         49         def get_attention_bwd(d_attention):

    ops.pyx in thinc.neural.ops.Ops.softmax()

    /home/gpu1/anaconda3/lib/python3.6/site-packages/numpy/core/fromnumeric.py in amax(a, axis, out, keepdims)
       2270 
       2271     return _methods._amax(a, axis=axis,
    -> 2272                           out=out, **kwargs)
       2273 
       2274 

    /home/gpu1/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py in _amax(a, axis, out, keepdims)
         24 # small reductions
         25 def _amax(a, axis=None, out=None, keepdims=False):
    ---> 26     return umr_maximum(a, axis, None, out, keepdims)
         27 
         28 def _amin(a, axis=None, out=None, keepdims=False):

    ValueError: zero-size array to reduction operation maximum which has no identity

Debugging my error using SpaCy's example's data & labels, I figured that my categories (y_train) are encoded properly. Figured this by matching my 'y_train' multiclass-labels to the textual data in the example).Hence the problem must within the textual data ('X_train') - probably in the parsing process.

Any idea where the problem could be?

Here is a snippet of my code, X_train is a list of strings (just like the data in SpaCy's example):


    def dictionerize_cats(y_train):
        l = []
        all_labels = np.unique(y_train) 
        for curr_label in y_train:
            l.append({cat: bool(cat==curr_label) for cat in all_labels})
        return l

    dic_cats = dictionerize_cats(y_train)
    dic_y_test = dictionerize_cats(y_test)

    for cat in np.unique(y_train):
        textcat.add_label(cat)

    train_data = list(zip(X_train,
                         [{'cats': cats} for cats in dic_cats]))

    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
    with nlp.disable_pipes(*other_pipes):
        print("Training the model...")
        print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))
        optimizer = nlp.begin_training()
        for i in range(5):
            losses = {}
            batches = minibatch(train_data, size=compounding(4., 32., 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
            with textcat.model.use_params(optimizer.averages):
                scores = evaluate(nlp.tokenizer, textcat, dev_texts_i, dic_y_test)
            print('{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}'  # print a simple table
                          .format(losses['textcat'], scores['textcat_p'],
                                  scores['textcat_r'], scores['textcat_f']))

honnibal commented 6 years ago

Best guess: Do you have any empty documents? These should work, but the error looks like an array is empty.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.