Closed Liranbz closed 3 years ago
Hello,
This sounds like more of a TensorFlow issue, as the krain calls are just wrapping calls to tf.keras
. Maybe you can try moving the get_learner
call outside of the mirrored_strategy
scope. According to the tf.keras documentation, only the model creation and model compiling need to be within the scope, which are both done by the text_classifier
invocation.
Also, if you're using multiple GPUs to speedup BERT training, I would strongly recommend using distilbert
instead of BERT. The performance is nearly the same but the size is cut in half --> faster training and less memory. You can use DistilBert either with text_classifier('distilbert', ..)
or using the Transformer API. Note that DistilBERT returns instances of TransformerDataset
, not arrays (e.g., x_train, y_train). So, you'll need to replace the return values of text_classifier
as shown in this example notebook.
EDIT: Also, there are two implementations of BERT in ktrain: Hugging Face transformers
and keras_bert
. It looks like you tried keras_bert
. So, if you have trouble with one you can also try the other.
Hi, Thank you for your answer. I moved the get_learner call outside of the mirrored_strategy scope but I still got the same error. I try to understand why I get OOM error only with multiple gpu`s and not in 1 gpu
Hmm - yeah that's strange. I have noticed some weird things with multigpu training in TF 2, so it seems a little flaky. See the example in this FAQ entry. It worked the last time I tried using a previous version of TensorFlow.
solved: I installed my venv again and not it works well, thank you!
Hi, I try to run my script on EC2, p3.16xl in this way:
and I got this error:
The same script running a-z on p3.2xl, with batch size=6, eval_batch_size=32 any ideas? Thanks