Open bschifferer opened 2 years ago
The HugeCTR team proposed that it could be related to having multiple GPUs and not using MirrorStrategies. They shared an example:
import os
import tensorflow as tf
import sparse_operation_kit as sok
os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async" # fraction of free memory
import nvtabular as nvt
from nvtabular.loader.tensorflow import KerasSequenceLoader, KerasSequenceValidater
from nvtabular.framework_utils.tensorflow import layers
BATCH_SIZE = 64000
os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, range(1)))
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
sok.Init(global_batch_size=BATCH_SIZE)
@bschifferer , please triage this bug
I dont know if this bug is still valid - it is from April 6th. I havent worked on SOK + dataloader since then. But if we want to provide both, then this is important
@bschifferer , is this a P0 or P1 ?
I am not able to import Merlin Dataloader + SOK in TensorFlow. Either order (first SOK -> Dataloader OR Dataloder -> SOK) throws an error (see thread).
Importing sok and then data loader
Importing Dataloader and then SOK