bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.62k stars 488 forks source link

BYTEPS_ENABLE_ASYNC=1 produces an incorrect result #142

Open bobroute opened 4 years ago

bobroute commented 4 years ago

Describe the bug When I set export BYTEPS_ENABLE_ASYNC=1 and run the demo mnist training. it produce the incorrect result like this:

BytePS: enable asynchronous training INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Create CheckpointSaverHook. [12:06:44] src/customer.cc:368: Do not use thread pool for receiving. [12:06:44] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024 [12:06:44] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4 [12:06:44] src/van.cc[:357: Bind to role=worker, ip=10.0.0.1, port=21127, is_recovery=0 12:06:44] src/./zmq_van.h:286: Start ZMQ recv thread INFO:tensorflow:Graph was finalized. 2019-11-05 12:06:45.071764: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531 pciBusID: 0000:04:00.0 totalMemory: 22.38GiB freeMemory: 22.13GiB 2019-11-05 12:06:45.071808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-11-05 12:06:45.071844: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-11-05 12:06:45.071858: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-11-05 12:06:45.071870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-11-05 12:06:45.072599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21532 MB memory) -> physical GPU (device: 0, name: Tesla P40, p ci bus id: 0000:04:00.0, compute capability: 6.1) INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into ./checkpoints/model.ckpt. [12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread [[12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread 12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread [12:07:13] src/van.cc:306: W[11] is connected to others INFO:tensorflow:loss = 2.3320873, step = 0 INFO:tensorflow:loss = 2.3230207, step = 0 INFO:tensorflow:loss = 2.3025851, step = 10 (0.428 sec) INFO:tensorflow:loss = 2.3025851, step = 10 (0.430 sec) INFO:tensorflow:loss = 2.3025851, step = 20 (0.314 sec) INFO:tensorflow:loss = 2.3025851, step = 20 (0.314 sec) INFO:tensorflow:loss = 2.3025851, step = 30 (0.297 sec) INFO:tensorflow:loss = 2.3025851, step = 30 (0.297 sec) INFO:tensorflow:loss = 2.3025851, step = 40 (0.318 sec) INFO:tensorflow:loss = 2.3025851, step = 40 (0.318 sec) INFO:tensorflow:loss = 2.3025851, step = 50 (0.305 sec) INFO:tensorflow:loss = 2.3025851, step = 50 (0.305 sec) INFO:tensorflow:loss = 2.3025851, step = 60 (0.311 sec) INFO:tensorflow:loss = 2.3025851, step = 60 (0.312 sec)

It looks the loss is incorrect and does not change. Someone can explain it?

ymjiang commented 4 years ago

Thank you for reporting this. We will try to reproduce and figure it out.