Hironsan / anago

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.
https://anago.herokuapp.com/
MIT License
1.48k stars 371 forks source link

Bug related to y dimension #103

Open valentas-kurauskas opened 5 years ago

valentas-kurauskas commented 5 years ago

I tried

model = anago.Sequence()
model.fit(x_train, y_train)

with my own data with just two different labels, but it gave an exception with a keras error on mismatching shape of X and y. It turned out that my data had blocks of sentences of length > 32, where each sentence had a single word and each label was the same.

The following fix worked for me:

--- a/anago/preprocessing.py
+++ b/anago/preprocessing.py
@@ -107,7 +107,8 @@ class IndexTransformer(BaseEstimator, TransformerMixin):
             # >>> to_categorical([[1]], num_classes=4).shape
             # (1, 4)
             # So, I expand dimensions when len(y.shape) == 2.
-            y = y if len(y.shape) == 3 else np.expand_dims(y, axis=0)
+            #y = y if len(y.shape) == 3 else np.expand_dims(y, axis=0)
+            y = y if len(y.shape) == 3 else y.reshape(y.shape+(1,)).transpose([0,2,1])
             return features, y
         else:
             return features
@@ -237,7 +238,8 @@ class ELMoTransformer(IndexTransformer):
             # >>> to_categorical([[1]], num_classes=4).shape
             # (1, 4)
             # So, I expand dimensions when len(y.shape) == 2.
-            y = y if len(y.shape) == 3 else np.expand_dims(y, axis=0)
+            #y = y if len(y.shape) == 3 else np.expand_dims(y, axis=0)
+            y = y if len(y.shape) == 3 else y.reshape(y.shape+(1,)).transpose([0,2,1])
             return features, y
         else:
             return features

This is due to inconsistency in keras to_categorical, but np.expand_dims used in the current code does not seem to solve this.

keras.utils.np_utils.to_categorical(numpy.array([[1],[1],[1]]))
keras.utils.np_utils.to_categorical(numpy.array([[1,0],[1,0],[1,0]]))