drawbridge / keras-mmoe

A TensorFlow Keras implementation of "Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts" (KDD 2018)
MIT License
681 stars 217 forks source link

Running example got topological sort failed with message: The graph couldn't be sorted in topological order. #5

Closed hazhang-wish closed 4 years ago

hazhang-wish commented 4 years ago
Train on 199523 samples, validate on 49881 samples
Epoch 1/100
2019-12-16 19:35:43.742576: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10 locally
199328/199523 [============================>.] - ETA: 0s - loss: 0.5737 - income_loss: 0.3506 - marital_loss: 0.2231 - income_acc: 0.9344 - marital_acc: 0.92562019-12-16 19:
36:23.028460: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:704] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topolog
ical order.
2019-12-16 19:36:23.029628: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:704] Iteration = 1, topological sort failed with message: The graph couldn't be sor
ted in topological order.
2019-12-16 19:36:23.036058: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:704] Iteration = 0, topological sort failed with message: The graph couldn't be sor
ted in topological order.
2019-12-16 19:36:23.037032: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:704] Iteration = 1, topological sort failed with message: The graph couldn't be sor
ted in topological order.

I run both examples will get above warning in the first few epochs. It will cause some errors when run model inference. How to solve it? It seems a bug from TF.

alvin319 commented 4 years ago

Hi! I'm not entirely sure if this is a bug in the code but rather a bug in TF. Have you tried different versions of Keras and Tensorflow?

hazhang-wish commented 4 years ago

I tried keras 2.2.4/2.2.5 and TF 1.13 and 1.14. Both have this issue. Refer to this: https://github.com/tensorflow/tensorflow/issues/24816

@alvin319 When you train the model, do you also have this warning?

alvin319 commented 4 years ago

I just tested the code and I am encountering the same warning, but I didn't have any issues during the model inference. Can you share some error logs?

hazhang-wish commented 4 years ago

I froze the session and export the model to Tensorflow *.pb model format. Then load it into Tensorflow. Inference from TF will show the above warning. The inference results are also different.

alvin319 commented 4 years ago

I haven't experimented with that yet and unfortunately, this might be a TensorFlow issue. From the issue you've linked, I suspect the culprit might be https://github.com/drawbridge/keras-mmoe/blob/master/mmoe.py#L190. I don't have much time to work on this right now and if you want to dig into this issue and submit a PR, please feel free!