amaiya / ktrain

ktrain is a Python library that makes deep learning and AI more accessible and easier to apply
Apache License 2.0
1.23k stars 268 forks source link

Text classification: OOM error, even for batch size=1, while running on multiple gpu`s #303

Closed Liranbz closed 3 years ago

Liranbz commented 3 years ago

Hi, I try to run my script on EC2, p3.16xl in this way:

x_train, x_test = list(x_train), list(x_test_new)
y_train, y_test = list(y_train), list(y_test_new)
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_array(x_train=x_train, y_train=y_train,
                                                                          x_test=x_test, y_test=y_test,
                                                                          #class_names=y_train,
                                                                          preprocess_mode="bert",
                                                                          maxlen=25,
                                                                          ngram_range=3,
                                                                          max_features=400)

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
    model = text.text_classifier('bert', train_data=(x_train, y_train), preproc=preproc)
learner = ktrain.get_learner(model, train_data=(x_train, y_train), val_data=(x_test, y_test),
                    batch_size=1, eval_batch_size=1)

and I got this error:

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[30522,768] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Add]

The same script running a-z on p3.2xl, with batch size=6, eval_batch_size=32 any ideas? Thanks

amaiya commented 3 years ago

Hello,

This sounds like more of a TensorFlow issue, as the krain calls are just wrapping calls to tf.keras. Maybe you can try moving the get_learner call outside of the mirrored_strategy scope. According to the tf.keras documentation, only the model creation and model compiling need to be within the scope, which are both done by the text_classifier invocation.

Also, if you're using multiple GPUs to speedup BERT training, I would strongly recommend using distilbert instead of BERT. The performance is nearly the same but the size is cut in half --> faster training and less memory. You can use DistilBert either with text_classifier('distilbert', ..) or using the Transformer API. Note that DistilBERT returns instances of TransformerDataset, not arrays (e.g., x_train, y_train). So, you'll need to replace the return values of text_classifier as shown in this example notebook.

EDIT: Also, there are two implementations of BERT in ktrain: Hugging Face transformers and keras_bert. It looks like you tried keras_bert. So, if you have trouble with one you can also try the other.

Liranbz commented 3 years ago

Hi, Thank you for your answer. I moved the get_learner call outside of the mirrored_strategy scope but I still got the same error. I try to understand why I get OOM error only with multiple gpu`s and not in 1 gpu

amaiya commented 3 years ago

Hmm - yeah that's strange. I have noticed some weird things with multigpu training in TF 2, so it seems a little flaky. See the example in this FAQ entry. It worked the last time I tried using a previous version of TensorFlow.

Liranbz commented 3 years ago

solved: I installed my venv again and not it works well, thank you!