Open bobroute opened 4 years ago
Describe the bug When I set export BYTEPS_ENABLE_ASYNC=1 and run the demo mnist training. it produce the incorrect result like this:
BytePS: enable asynchronous training INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Create CheckpointSaverHook. [12:06:44] src/customer.cc:368: Do not use thread pool for receiving. [12:06:44] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024 [12:06:44] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4 [12:06:44] src/van.cc[:357: Bind to role=worker, ip=10.0.0.1, port=21127, is_recovery=0 12:06:44] src/./zmq_van.h:286: Start ZMQ recv thread INFO:tensorflow:Graph was finalized. 2019-11-05 12:06:45.071764: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531 pciBusID: 0000:04:00.0 totalMemory: 22.38GiB freeMemory: 22.13GiB 2019-11-05 12:06:45.071808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-11-05 12:06:45.071844: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-11-05 12:06:45.071858: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-11-05 12:06:45.071870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-11-05 12:06:45.072599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21532 MB memory) -> physical GPU (device: 0, name: Tesla P40, p ci bus id: 0000:04:00.0, compute capability: 6.1) INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into ./checkpoints/model.ckpt. [12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread [[12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread 12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread [12:07:13] src/van.cc:306: W[11] is connected to others INFO:tensorflow:loss = 2.3320873, step = 0 INFO:tensorflow:loss = 2.3230207, step = 0 INFO:tensorflow:loss = 2.3025851, step = 10 (0.428 sec) INFO:tensorflow:loss = 2.3025851, step = 10 (0.430 sec) INFO:tensorflow:loss = 2.3025851, step = 20 (0.314 sec) INFO:tensorflow:loss = 2.3025851, step = 20 (0.314 sec) INFO:tensorflow:loss = 2.3025851, step = 30 (0.297 sec) INFO:tensorflow:loss = 2.3025851, step = 30 (0.297 sec) INFO:tensorflow:loss = 2.3025851, step = 40 (0.318 sec) INFO:tensorflow:loss = 2.3025851, step = 40 (0.318 sec) INFO:tensorflow:loss = 2.3025851, step = 50 (0.305 sec) INFO:tensorflow:loss = 2.3025851, step = 50 (0.305 sec) INFO:tensorflow:loss = 2.3025851, step = 60 (0.311 sec) INFO:tensorflow:loss = 2.3025851, step = 60 (0.312 sec)
It looks the loss is incorrect and does not change. Someone can explain it?
Thank you for reporting this. We will try to reproduce and figure it out.
Describe the bug When I set export BYTEPS_ENABLE_ASYNC=1 and run the demo mnist training. it produce the incorrect result like this:
BytePS: enable asynchronous training INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Create CheckpointSaverHook. [12:06:44] src/customer.cc:368: Do not use thread pool for receiving. [12:06:44] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024 [12:06:44] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4 [12:06:44] src/van.cc[:357: Bind to role=worker, ip=10.0.0.1, port=21127, is_recovery=0 12:06:44] src/./zmq_van.h:286: Start ZMQ recv thread INFO:tensorflow:Graph was finalized. 2019-11-05 12:06:45.071764: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531 pciBusID: 0000:04:00.0 totalMemory: 22.38GiB freeMemory: 22.13GiB 2019-11-05 12:06:45.071808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-11-05 12:06:45.071844: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-11-05 12:06:45.071858: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-11-05 12:06:45.071870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-11-05 12:06:45.072599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21532 MB memory) -> physical GPU (device: 0, name: Tesla P40, p ci bus id: 0000:04:00.0, compute capability: 6.1) INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into ./checkpoints/model.ckpt. [12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread [[12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread 12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread [12:07:13] src/van.cc:306: W[11] is connected to others INFO:tensorflow:loss = 2.3320873, step = 0 INFO:tensorflow:loss = 2.3230207, step = 0 INFO:tensorflow:loss = 2.3025851, step = 10 (0.428 sec) INFO:tensorflow:loss = 2.3025851, step = 10 (0.430 sec) INFO:tensorflow:loss = 2.3025851, step = 20 (0.314 sec) INFO:tensorflow:loss = 2.3025851, step = 20 (0.314 sec) INFO:tensorflow:loss = 2.3025851, step = 30 (0.297 sec) INFO:tensorflow:loss = 2.3025851, step = 30 (0.297 sec) INFO:tensorflow:loss = 2.3025851, step = 40 (0.318 sec) INFO:tensorflow:loss = 2.3025851, step = 40 (0.318 sec) INFO:tensorflow:loss = 2.3025851, step = 50 (0.305 sec) INFO:tensorflow:loss = 2.3025851, step = 50 (0.305 sec) INFO:tensorflow:loss = 2.3025851, step = 60 (0.311 sec) INFO:tensorflow:loss = 2.3025851, step = 60 (0.312 sec)
It looks the loss is incorrect and does not change. Someone can explain it?