DeepRec-AI / DeepRec

DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.
Apache License 2.0
1.05k stars 354 forks source link

【grpc++】env_->rendezvous_mgr->RecvLocalAsync failed, error msg is: [_Derived_]End of sequence #326

Open kpsc opened 2 years ago

kpsc commented 2 years ago

System information

when i used grpc++ in estimator, i got the following error,but it still training, i don't know whether it is ok

image

config = tf.estimator.RunConfig( save_checkpoints_secs=10 * 60, keep_checkpoint_max=2, protocol='grpc++' ) model = tf.estimator.Estimator( model_fn=model_fn, params=model_params, model_dir=checkpoint, config=config ) eval_spec = tf.estimator.EvalSpec(...) train_spec = tf.estimator.TrainSpec(...) tf.estimator.train_and_evaluate(model, train_spec, eval_spec)

In the DeepRec-doc, I found that it seems there some problem with ori-estimator,but I bazel failed and don't know what's Estimator check like when using grpc++,in the deeprec last version whether we need to install estimaotr specially?

shanshanpt commented 2 years ago

"End of sequence" means the data was finished, in general, estimator handle the exception naturally. If you use 'MonitoredTrainingSession' API, it may encounter this log. Which estimator you installed, we offered a version in github: https://github.com/AlibabaPAI/estimator/tree/deeprec

kpsc commented 2 years ago

Thanks for your reply. And I have anthor question, when I used grpc++ in distributed training, it's slow than grpc, is there anything else about training set? In the network, I only used normal embedding with tensorflow

liutongxuan commented 2 years ago

There's list of tips to help you to tune the grpc++, follow the https://deeprec.readthedocs.io/zh/latest/GRPC%2B%2B.html