Hironsan / anago

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.
https://anago.herokuapp.com/
MIT License
1.48k stars 368 forks source link

OOM with IndexedSlices Convertion #74

Open WenYanger opened 6 years ago

WenYanger commented 6 years ago

System information

Describe the problem

OOM(Out of memory) Error occured with a Warning :

UserWarning: Converting sparse IndexedSlices to a dense Tensor with 577296600 elements. This may consume a large amount of memory.

However,a highly similar data could run on the same code. The size and format of data are all the same, looks like this:

Text:   [['AAA', 'BBB', 'CCC'], ['AAA', 'BBB', 'CCC', 'DDD]]
Label: [['1', '0', '1'], ['1', '0', '1', '0']]

I wonder which step in my code (or data) lead to such Warning, because another similar data haven't raised this Warning ~ T.T

Article on StackOverFlow said it is caused by TensorFlow function tf.gather(). Maybe it is the issue?

https://stackoverflow.com/questions/45882401/how-to-deal-with-userwarning-converting-sparse-indexedslices-to-a-dense-tensor

Source code / logs

print('Loading Data')
corpus_k = pickle.load(open('../data/keywords_cleaned_100.pkl', 'rb'))
corpus_c = pickle.load(open('../data/corpus_cleaned_100.pkl', 'rb'))

if os.path.exists('../data/y_keyword_retrival.pkl'):
    y = pickle.load(open('../data/y_keyword_retrival.pkl', 'rb'))
else:
    y = []
    for i in range(corpus_c.shape[0]):
        if i % 1000 == 0: print(i)
        t1 = corpus_k[i]
        t2 = corpus_c[i]

        s1 = set(t1)
        l = []
        for word in t2:
            if word in s1:
                l.append('1')
            else:
                l.append('0')
        y.append(l)
    y = np.array(y)
    pickle.dump(y, open('../data/y_keyword_retrival.pkl', 'wb+'))

vec = gensim.models.word2vec.Word2Vec.load('../data/w2v_0428')

weights_file = './model_weights.h5'
params_file = './params.json'
preprocessor_file = './preprocessor.json'

print('Train Test Split ... ')
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(corpus_c, y, test_size=0.1, random_state=42)

print('Training ... ')
import anago
model = anago.Sequence(
    word_lstm_size=300,
    word_embedding_dim=300,
    embeddings=vec,
    use_char=False
)
model.fit(X_train, y_train, batch_size=256, epochs=5)
s = model.score(X_test, y_test)
model.save(weights_file, params_file, preprocessor_file)
psinger commented 5 years ago

Same issue, any hints?

WenYanger commented 5 years ago

Same issue, any hints?

No idea, bro.

psinger commented 5 years ago

Same issue, any hints?

No idea, bro.

If this is still relevant to you. I found an issue with very long documents and words for my problem at hand. The current code is padding sequences and tokens based on the longest sequence and token within the current batch. So if you for example have a token with length 1000 in the batch, all tokens get padded to that size which can increase the necessity of memory allocation heavily.

One solution is to change the padding code in the package, or simpler, just do some pre-processing on your data to only create sequences of length X and tokens of length Y.