ageron / handson-ml2

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.
Apache License 2.0
27.95k stars 12.79k forks source link

[QUESTION]CH16, Sentiment Analysis Numpy version issue/Memory leak? #419

Open Barleysack opened 3 years ago

Barleysack commented 3 years ago

I found couple of issues executing this code, while using local environment to execute the code. Well this was the code from the Textbook, and notebook file of ch 16, from Sentiment Analysis.

import tensorflow_datasets as tfds
import tensorflow as tf
from tensorflow import keras
import tensorboard
import os
import numpy as np

(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data()

X_train[0][:10]

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)
train_size = info.splits["train"].num_examples
test_size = info.splits["test"].num_examples
print(train_size)

X_train[0][:10]

word_index = keras.datasets.imdb.get_word_index()
id_to_word = {id_ + 3: word for word, id_ in word_index.items()}
for id_, token in enumerate(("<pad>", "<sos>", "<unk>")):
    id_to_word[id_] = token
" ".join([id_to_word[id_] for id_ in X_train[0][:10]])

import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)

datasets.keys()

train_size = info.splits["train"].num_examples
test_size = info.splits["test"].num_examples

for X_batch, y_batch in datasets["train"].batch(2).take(1):
    for review, label in zip(X_batch.numpy(), y_batch.numpy()):
        print("Review:", review.decode("utf-8")[:200], "...")
        print("Label:", label, "= Positive" if label else "= Negative")
        print()

def preprocess(X_batch, y_batch):
    X_batch = tf.strings.substr(X_batch, 0, 300)
    X_batch = tf.strings.regex_replace(X_batch, rb"<br\s*/?>", b" ")
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
    X_batch = tf.strings.split(X_batch)
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch

preprocess(X_batch, y_batch)

from collections import Counter

vocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
    for review in X_batch:
        vocabulary.update(list(review.numpy()))

vocab_size = 10000
truncated_vocabulary = [
    word for word, count in vocabulary.most_common()[:vocab_size]]

words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

def encode_words(X_batch, y_batch):
    return table.lookup(X_batch), y_batch

train_set = datasets["train"].repeat().batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

embed_size = 128
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size,
                           mask_zero=True, # not shown in the book
                           input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(train_set, steps_per_epoch=train_size // 32, epochs=5)

First, Numpy 1.20.1 does not work with TF 2.4.1 in this code. it will return

Cannot convert a symbolic Tensor (gru/strided_slice:0) to a numpy array.

as a exception message, I'm sorry that I could not get full message since I solved it by downgrading the numpy to 1.19.2.

pip install numpy==1.19.2

will handle this problem anyway.

Second, even if we solved the version issue, we still have memory issue.

[_Derived_]RecvAsync is cancelled.
     [[{{node Adam/Adam/update/AssignSubVariableOp/_41}}]]
     [[gradient_tape/sequential/embedding/embedding_lookup/Reshape/_38]] [Op:__inference_train_function_115627]

For this exception message, I added

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_memory_growth(gpus[0], True)
  except RuntimeError as e:

    print(e)

to solve the problem. However, I discovered that local GPU memories are not utilized perfectly. I'll attach screenshots for Performance graph from task manager of windows 10.

This is when I run other codes from textbook

Usual case

Aand this is what happens when I run the code that matters:/

Ch16Sentiment

Does anyone know why are they happening? Btw, sorry for screenshots including some Korean. hope you guys understand the screenshot with context:)

Versions (please complete the following information):

ageron commented 3 years ago

Hi @Barleysack ,

Thanks for your feedback.

Regarding the first issue, indeed the latest version of TensorFlow is not compatible with NumPy 1.19 (as of May 2021, this will change in the future). So installing NumPy 1.19.2 is the correct solution.

Regarding the second issue, I've never run into it, but it looks like TensorFlow issue #33721.

By default, TensorFlow reserves (almost) all the available RAM on the GPU, and then uses it. This is why you would usually see RAM usage near 6 GB. But once you set the memory growth to True, TensorFlow behaves differently: it will only reserve RAM as it needs it (but it never releases it). This is why you only see the GPU RAM usage being quite low in the second image: that's simply because your program does not use more RAM than that.

Setting the memory growth to True can be useful when you want to run two or more TensorFlow programs on the same machine at the same time, so they can share the GPU RAM. But if you only have a single TensorFlow program running on the machine, then I recommend just using the default (unless setting the memory growth to True fixes the memory bug for some reason).

Note that if your model sometimes uses too much GPU RAM, then setting the memory growth to True will not help. In this case, the solution would be to reduce the batch size, or the model size, and basically try to use less RAM.

I hope this helps.