Did anyone reproduce the result listed in the paper with multi-GPUs?

xingwangsfu commented 5 years ago

I did nothing to the code except replacing TPU with multi-gpus. And my training is stuck at a very high loss. I assume this is caused by the large learning rate after warm-start since my loss is normal before iteration 6255, which is the number of warm-start iterations.

Here is my training log: I0618 17:29:12.284162 140720170149632 tf_logging.py:115] global_step/sec: 1.26958 I0618 17:29:12.284616 140720170149632 tf_logging.py:115] loss = 6.7994366, step = 5760 (50.410 sec) I0618 17:30:03.805228 140720170149632 tf_logging.py:115] global_step/sec: 1.24221 I0618 17:30:03.805763 140720170149632 tf_logging.py:115] loss = 6.8188353, step = 5824 (51.521 sec) I0618 17:30:54.878273 140720170149632 tf_logging.py:115] global_step/sec: 1.25311 I0618 17:30:54.878723 140720170149632 tf_logging.py:115] loss = 6.8027916, step = 5888 (51.073 sec) I0618 17:31:46.115418 140720170149632 tf_logging.py:115] global_step/sec: 1.24909 I0618 17:31:46.144254 140720170149632 tf_logging.py:115] loss = 6.8010216, step = 5952 (51.266 sec) I0618 17:32:37.585529 140720170149632 tf_logging.py:115] global_step/sec: 1.24344 I0618 17:32:37.585925 140720170149632 tf_logging.py:115] loss = 6.789137, step = 6016 (51.442 sec) I0618 17:33:28.797896 140720170149632 tf_logging.py:115] global_step/sec: 1.2497 I0618 17:33:28.798456 140720170149632 tf_logging.py:115] loss = 6.799903, step = 6080 (51.213 sec) I0618 17:34:19.681088 140720170149632 tf_logging.py:115] global_step/sec: 1.25778 I0618 17:34:19.681564 140720170149632 tf_logging.py:115] loss = 6.803883, step = 6144 (50.883 sec) I0618 17:35:09.831330 140720170149632 tf_logging.py:115] global_step/sec: 1.27617 I0618 17:35:09.831943 140720170149632 tf_logging.py:115] loss = 6.7922, step = 6208 (50.150 sec) I0618 17:35:46.901006 140720170149632 tf_logging.py:115] Saving checkpoints for 6255 into /mnt/cephfs_wj/cv/wangxing/tmp/model-single-path-search/lambda-val-0.020/model.ckpt. I0618 17:36:07.512706 140720170149632 tf_logging.py:115] global_step/sec: 1.10954 I0618 17:36:07.513106 140720170149632 tf_logging.py:115] loss = 94.23678, step = 6272 (57.681 sec) I0618 17:36:57.975293 140720170149632 tf_logging.py:115] global_step/sec: 1.26827 I0618 17:36:57.975636 140720170149632 tf_logging.py:115] loss = 84.60893, step = 6336 (50.463 sec) I0618 17:37:49.209366 140720170149632 tf_logging.py:115] global_step/sec: 1.24917 I0618 17:37:49.210039 140720170149632 tf_logging.py:115] loss = 83.81077, step = 6400 (51.234 sec) I0618 17:38:40.446595 140720170149632 tf_logging.py:115] global_step/sec: 1.24909 I0618 17:38:40.447212 140720170149632 tf_logging.py:115] loss = 83.7096, step = 6464 (51.237 sec) I0618 17:39:31.800470 140720170149632 tf_logging.py:115] global_step/sec: 1.24625 I0618 17:39:31.800811 140720170149632 tf_logging.py:115] loss = 75.41687, step = 6528 (51.354 sec) I0618 17:40:22.979326 140720170149632 tf_logging.py:115] global_step/sec: 1.25052 I0618 17:40:22.979668 140720170149632 tf_logging.py:115] loss = 75.42241, step = 6592 (51.179 sec) I0618 17:41:14.112971 140720170149632 tf_logging.py:115] global_step/sec: 1.25162 I0618 17:41:14.137188 140720170149632 tf_logging.py:115] loss = 75.344826, step = 6656 (51.157 sec) I0618 17:42:05.177355 140720170149632 tf_logging.py:115] global_step/sec: 1.25332 I0618 17:42:05.177694 140720170149632 tf_logging.py:115] loss = 75.358315, step = 6720 (51.041 sec) I0618 17:42:56.014090 140720170149632 tf_logging.py:115] global_step/sec: 1.25893 I0618 17:42:56.014433 140720170149632 tf_logging.py:115] loss = 75.37303, step = 6784 (50.837 sec) I0618 17:43:47.115759 140720170149632 tf_logging.py:115] global_step/sec: 1.25241 I0618 17:43:47.116162 140720170149632 tf_logging.py:115] loss = 75.35231, step = 6848 (51.102 sec) I0618 17:44:38.047000 140720170149632 tf_logging.py:115] global_step/sec: 1.2566 I0618 17:44:38.047545 140720170149632 tf_logging.py:115] loss = 75.34932, step = 6912 (50.931 sec)

yiyang92 commented 5 years ago

Carefully check data_format in all keras Convolutions. Code actually doesn't support channels_first.

songtianlong commented 5 years ago

6255是warmup结束之后dropoutrate变成100 我也是在6255步之后loss发散的你解决这个问题了吗？

HaisongDing commented 5 years ago

Any updates?

bermanmaxim commented 5 years ago

I'm also interested to know if someone reproduced the results at all (on multi-gpu or on TPU)

HaisongDing commented 5 years ago

From what I understand, that the total loss becomes very high is normal because runtime loss is only included after the warmup stage: code

iamweiweishi commented 5 years ago

@xingwangsfu Hi, I replaced 'tf.contrib.tpu.TPUEstimatorSpec' with 'tf.estimator.EstimatorSpec', but I found that the latter one does not have the method 'host_call', how to handle the problem? Many thanks.

marsggbo commented 5 years ago

@xingwangsfu Hi, I replaced 'tf.contrib.tpu.TPUEstimatorSpec' with 'tf.estimator.EstimatorSpec', but I found that the latter one does not have the method 'host_call', how to handle the problem? Many thanks.

Hello, have you solved this problem?

MissyDu commented 5 years ago

@marsggbo "host_call" can be ignored by set "skip_host_call" to be True.

MissyDu commented 5 years ago

@xingwangsfu, did you fix the issue? In the paper, fig.4(left) shows the CE loss is from about 7.0 to be about 3.0, but didn't mention the total loss. I get same result with yours, so large loss(>75) means converge fail. And I reduce loss "75" to be about "15" by set "runtime_lambda_val" from 0.02 to 0.002. But still large. Besides, how about evaluation accuracy during training process, I just get less than 1％.

QueeneTam commented 5 years ago

@xingwangsfu, did you fix the issue? In the paper, fig.4(left) shows the CE loss is from about 7.0 to be about 3.0, but didn't mention the total loss. I get same result with yours, so large loss(>75) means converge fail. And I reduce loss "75" to be about "15" by set "runtime_lambda_val" from 0.02 to 0.002. But still large. Besides, how about evaluation accuracy during training process, I just get less than 1％.

Hello, there are some exceptions when I run the code, may you give me some advise about this dropout issue? Mant thanks! Question link: https://github.com/dstamoulis/single-path-nas/issues/11#issuecomment-549108585

enyac-group / single-path-nas

Did anyone reproduce the result listed in the paper with multi-GPUs? #9