diux-dev / cluster

train on AWS
75 stars 15 forks source link

ImageNet: spikes in data_time for 16-machine version #42

Open yaroslavvb opened 6 years ago

yaroslavvb commented 6 years ago

This statistic here spikes to 20ms on 16-machine run: @bearpelican https://github.com/diux-dev/cluster/blob/13fb813d4a3a96a6c27bda1d5bee979d0d3d125e/pytorch/training/train_imagenet_nv.py#L380

http://35.174.18.63:6006/#scalars&tagFilter=times%2Fdata&runSelectionState=eyJlaWdodC1tYWNoaW5lcy12YSI6ZmFsc2UsIm9uZS1tYWNoaW5lcy12YSI6ZmFsc2UsImVpZ2h0LW1hY2hpbmVzLXZhMDIiOmZhbHNlLCJuY2NsLWVpZ2h0LW1hY2hpbmVzLWFyIjpmYWxzZSwibmNjbC1laWdodC1tYWNoaW5lcy1hci4yMDE4LTA3LTMwXzIzLTI0LTE0IjpmYWxzZSwib25lLW1hY2hpbmVzMTI4IjpmYWxzZSwiZGVsZXRlbWUiOmZhbHNlLCJ3ZWQtdHdvLW1hY2hpbmVzIjpmYWxzZSwiZGVsZXRlbWUuMDEiOmZhbHNlLCJ3ZWQtdHdvLW1hY2hpbmVzLjAxIjpmYWxzZSwicGlsbG93LW9uZSI6dHJ1ZSwicGlsbG93LXR3byI6ZmFsc2UsInBpbGxvdy1zaXh0ZWVuIjpmYWxzZSwicGlsbG93LXNpeHRlZW4uMDEiOmZhbHNlLCJwaWxsb3ctc2l4dGVlbi4wMiI6dHJ1ZX0%3D

screenshot 2018-08-01 21 32 30