Closed Nov11 closed 1 year ago
Hi @Nov11 , thank you for use SOK . in your example , I think you need use graph model to run tf mirrored strategy, just like this:
@tf.function
def work(_p):
return plugin_demo(_p)
I already try you example in my local , and it shows don't use eager model can solve your problem , you can have a try on your own machine.
why the problem happen?
Because SOK is model parallel embedding , so we need every card launch op together , so between every card thread(mirrored strategy is single process and multi threads) ,we have a sync use std::condition_variable . But when you use eager model, tensorflow don't try to launch all the threads currently, only launch threads in serial , so this will be create deadlock, and if dead lock time is long , std::condition_variable.wait_for will quit ,and SOK will raise a error(BlockingCallOnce time out.
). And tf graph model can launch all the threads currently ,so the problem can be solved.
recommend for you Now we plan a new SOK implement , and already have some feature to use , new SOK called sok.experiment, you can search how to use it in this SOK experiment we recommend user use new SOK , and in May 2023, we will move SOK experiment to SOK official ,and abandon the old SOK implement.
Yes, graph mode works! Great explanation for the mechanism. I'll move on to sok.experiment. Thank you!
I'm new to sparse operation kit and want to use it in my project. I tried to write a little demo after reading 'sparse_operation_kit_demo' under notebook folder. Basically it calls embedding layer inside mirrorred strategy scope. Unfortunately it doesn't work and I can't figure out what's missing. Please help me fix this.
Code :
Output from merline-tensorflow container:
I think it says something not ready after a timeout. Just cannot find out what is missing.