Hi! I'm trying to increase the batch size for the training of the model but each time I execute it, in the training face, it gives me the following error:
(0) Resource exhausted: OOM when allocating tensor with shape[16,112,40,40] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Which normally points to a lack of memory RAM. Because of this, I tried different approaches for solving the problem:
Increasing the memory and CPU from the Notebook instance from ml.t3.medium [2 virtual CPU and 4 GB memory RAM] to ml.m5.2xlarge [8 virtual CPU and 8 GB memory RAM]
Increasing the memory and CPU from the training of the model, this means changing he instance_type input parameter inside CustomFramework in the 2_train_model.ipynb module, from ml.p3.2xlarge [8 virtual CPU and 61 GB memory RAM] to ml.g4dn.8xlarge [32 virtual CPU and 128 GB memory RAM]
However, in both instances I had the same error. Is there a way for increasing the batch size without having an ResourceExhaustedError?
@nfbalbontin I this is lack of GPU memory. More like how TF ODI manages it across multiple GPUs
I would check issues on the TF ODI github repo for help in this.
Hi! I'm trying to increase the batch size for the training of the model but each time I execute it, in the training face, it gives me the following error:
(0) Resource exhausted: OOM when allocating tensor with shape[16,112,40,40] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Which normally points to a lack of memory RAM. Because of this, I tried different approaches for solving the problem:ml.t3.medium
[2 virtual CPU and 4 GB memory RAM] toml.m5.2xlarge
[8 virtual CPU and 8 GB memory RAM]instance_type
input parameter insideCustomFramework
in the2_train_model.ipynb
module, fromml.p3.2xlarge
[8 virtual CPU and 61 GB memory RAM] toml.g4dn.8xlarge
[32 virtual CPU and 128 GB memory RAM]However, in both instances I had the same error. Is there a way for increasing the batch size without having an
ResourceExhaustedError
?