[Question] Having a hard time running demo with tensorflow2 mirrorredstrategy #385

Closed Nov11 closed 1 year ago

Nov11 commented 1 year ago

I'm new to sparse operation kit and want to use it in my project. I tried to write a little demo after reading 'sparse_operation_kit_demo' under notebook folder. Basically it calls embedding layer inside mirrorred strategy scope. Unfortunately it doesn't work and I can't figure out what's missing. Please help me fix this.

Code :

import os
import sparse_operation_kit as sok
import tensorflow as tf

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

class SOKDemo(tf.keras.models.Model):
    def __init__(self):
        super(SOKDemo, self).__init__()

        self.embedding_layer = sok.DistributedEmbedding(combiner='sum',

    def call(self, inputs, training=True):
        embedding_vector = self.embedding_layer(inputs, training=training)
        embedding_vector = tf.reshape(embedding_vector, shape=[-1, 4])
        return embedding_vector

strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
    result = sok.Init(global_batch_size=10)

    plugin_demo = SOKDemo()

p = tf.sparse.SparseTensor(indices=tf.constant([[0, 0], [1, 0], [1, 1]], dtype=tf.int64),
                           values=tf.constant([1, 1, 1], dtype=tf.int64),
                           dense_shape=tf.constant([2, 3], dtype=tf.int64))

def work(_p):
    return plugin_demo(_p)

strategy.run(work, args=(p,))

I think it says something not ready after a timeout. Just cannot find out what is missing.

kanghui0204 commented 1 year ago

Hi @Nov11 , thank you for use SOK . in your example , I think you need use graph model to run tf mirrored strategy, just like this:

def work(_p):
    return plugin_demo(_p)

I already try you example in my local , and it shows don't use eager model can solve your problem , you can have a try on your own machine.

why the problem happen? Because SOK is model parallel embedding , so we need every card launch op together , so between every card thread(mirrored strategy is single process and multi threads) ,we have a sync use std::condition_variable . But when you use eager model, tensorflow don't try to launch all the threads currently, only launch threads in serial , so this will be create deadlock, and if dead lock time is long , std::condition_variable.wait_for will quit ,and SOK will raise a error(BlockingCallOnce time out.). And tf graph model can launch all the threads currently ,so the problem can be solved.

recommend for you Now we plan a new SOK implement , and already have some feature to use , new SOK called sok.experiment, you can search how to use it in this SOK experiment we recommend user use new SOK , and in May 2023, we will move SOK experiment to SOK official ,and abandon the old SOK implement.

Nov11 commented 1 year ago

Yes, graph mode works! Great explanation for the mechanism. I'll move on to sok.experiment. Thank you!