【grpc++】env_->rendezvous_mgr->RecvLocalAsync failed, error msg is: [_Derived_]End of sequence

kpsc commented 2 years ago

System information

Have I written custom code :
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
TensorFlow installed from (source or binary): DeepRec
TensorFlow version : tf1.15
Python version: python3.6

when i used grpc++ in estimator, i got the following error，but it still training, i don't know whether it is ok

config = tf.estimator.RunConfig( save_checkpoints_secs=10 * 60, keep_checkpoint_max=2, protocol='grpc++' ) model = tf.estimator.Estimator( model_fn=model_fn, params=model_params, model_dir=checkpoint, config=config ) eval_spec = tf.estimator.EvalSpec(...) train_spec = tf.estimator.TrainSpec(...) tf.estimator.train_and_evaluate(model, train_spec, eval_spec)

In the DeepRec-doc, I found that it seems there some problem with ori-estimator，but I bazel failed and don't know what's Estimator check like when using grpc++，in the deeprec last version whether we need to install estimaotr specially?

shanshanpt commented 2 years ago

"End of sequence" means the data was finished, in general, estimator handle the exception naturally. If you use 'MonitoredTrainingSession' API, it may encounter this log. Which estimator you installed, we offered a version in github: https://github.com/AlibabaPAI/estimator/tree/deeprec

kpsc commented 2 years ago

Thanks for your reply. And I have anthor question, when I used grpc++ in distributed training, it's slow than grpc, is there anything else about training set? In the network, I only used normal embedding with tensorflow

liutongxuan commented 2 years ago

There's list of tips to help you to tune the grpc++, follow the https://deeprec.readthedocs.io/zh/latest/GRPC%2B%2B.html

DeepRec-AI / DeepRec

【grpc++】env_->rendezvous_mgr->RecvLocalAsync failed, error msg is: [_Derived_]End of sequence #326