lasagne_mnist no completing training

Nqabz commented 7 years ago

xxxx@dgx1:~/model_params/synkhronos/demos$ nvidia-docker run $(curl -s localhost:3476/docker/cli) -it -v /opt/conda/lib/python3.5/site-packages/synkhronos/pkl/:/opt/conda/lib/python3.5/site-packages/synkhronos/pkl/ -v /home/dlq/model_params/:/home/dlq/model_params/:rw --rm -e THEANO_FLAGS="floatX=float32,device=cpu,force_device=True" nv_theano_17.07:Py3-TF-LSGNE_3 python3 /home/dlq/model_params/synkhronos/demos/lasagne_mnist.py 

============
== Theano ==
============

NVIDIA Release 17.07 (build 86553)

Container image Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.
Copyright (c) 2008--2016, Theano Development Team
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

Loading data...
Downloading train-images-idx3-ubyte.gz
Downloading train-labels-idx1-ubyte.gz
Downloading t10k-images-idx3-ubyte.gz
Downloading t10k-labels-idx1-ubyte.gz
Synkhronos attempting to use 8 of 8 detected GPUs...
/opt/conda/lib/python3.5/site-packages/numpy/ctypeslib.py:435: RuntimeWarning: Item size computed from the PEP 3118 buffer format string does not match the actual item size.
  return array(obj, copy=False)
/opt/conda/lib/python3.5/site-packages/theano/gpuarray/dnn.py:135: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to version 5.1.
  warnings.warn("Your cuDNN version is more recent than "
Using cuDNN version 6021 on context None
Mapped name None to device cuda3: Tesla P100-SXM2-16GB (0000:0B:00.0)
INFO (theano.gof.compilelock): Waiting for existing lock by process '45' (I am process '1')
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '45' (I am process '47')
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '45' (I am process '55')
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '45' (I am process '50')
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '45' (I am process '53')
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '45' (I am process '57')
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
/opt/conda/lib/python3.5/site-packages/theano/gpuarray/dnn.py:135: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to version 5.1.
  warnings.warn("Your cuDNN version is more recent than "
Using cuDNN version 6021 on context None
Mapped name None to device cuda1: Tesla P100-SXM2-16GB (0000:07:00.0)
INFO (theano.gof.compilelock): Waiting for existing lock by process '1' (I am process '47')
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '1' (I am process '55')
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '1' (I am process '50')
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '1' (I am process '53')
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '1' (I am process '57')
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
/opt/conda/lib/python3.5/site-packages/theano/gpuarray/dnn.py:135: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to version 5.1.
  warnings.warn("Your cuDNN version is more recent than "
Using cuDNN version 6021 on context None
Mapped name None to device cuda0: Tesla P100-SXM2-16GB (0000:06:00.0)
INFO (theano.gof.compilelock): Waiting for existing lock by process '47' (I am process '55')
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '47' (I am process '53')
INFO (theano.gof.compilelock): Waiting for existing lock by process '47' (I am process '50')
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '47' (I am process '57')
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
/opt/conda/lib/python3.5/site-packages/theano/gpuarray/dnn.py:135: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to version 5.1.
  warnings.warn("Your cuDNN version is more recent than "
Using cuDNN version 6021 on context None
Mapped name None to device cuda2: Tesla P100-SXM2-16GB (0000:0A:00.0)
INFO (theano.gof.compilelock): Waiting for existing lock by process '55' (I am process '53')
INFO (theano.gof.compilelock): Waiting for existing lock by process '55' (I am process '50')
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '55' (I am process '57')
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
/opt/conda/lib/python3.5/site-packages/theano/gpuarray/dnn.py:135: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to version 5.1.
  warnings.warn("Your cuDNN version is more recent than "
Using cuDNN version 6021 on context None
Mapped name None to device cuda6: Tesla P100-SXM2-16GB (0000:89:00.0)
INFO (theano.gof.compilelock): Waiting for existing lock by unknown process (I am process '53')
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
INFO (theano.gof.compilelock): Waiting for existing lock by process '50' (I am process '57')
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
/opt/conda/lib/python3.5/site-packages/theano/gpuarray/dnn.py:135: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to version 5.1.
  warnings.warn("Your cuDNN version is more recent than "
Using cuDNN version 6021 on context None
Mapped name None to device cuda4: Tesla P100-SXM2-16GB (0000:85:00.0)
INFO (theano.gof.compilelock): Waiting for existing lock by process '53' (I am process '57')
INFO (theano.gof.compilelock): To manually release the lock, delete /tmp/theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-3.5.4-64/lock_dir
/opt/conda/lib/python3.5/site-packages/theano/gpuarray/dnn.py:135: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to version 5.1.
  warnings.warn("Your cuDNN version is more recent than "
Using cuDNN version 6021 on context None
Mapped name None to device cuda5: Tesla P100-SXM2-16GB (0000:86:00.0)
/opt/conda/lib/python3.5/site-packages/theano/gpuarray/dnn.py:135: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to version 5.1.
  warnings.warn("Your cuDNN version is more recent than "
Using cuDNN version 6021 on context None
Mapped name None to device cuda7: Tesla P100-SXM2-16GB (0000:8A:00.0)
Synkhronos: 8 GPUs initialized, master rank: 0

- Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/5D2WMIPB7Q4HAPD5SYELBPXTT3:/var/lib/docker/overlay2/l/PQGXTJGGDXJQP542FYNIQZQ7Z4:/var/lib/docker/overlay2/l/JIZWWCGEZCHUNB2HJHMCPCR63B:/var/lib/docker/overlay2/l/OT2HXLXJKQNY4IWYMEGKXZBSDG:/var/lib/docker/overlay2/l/J66SNH4FDGTUBVAAUEE2WCABEJ:/var/lib/docker/overlay2/l/6MXX7IS7GENJOT3UUDABP3YXWL:/var/lib/docker/overlay2/l/JBWJ4LSU7FDEULVVUO6ZNARDRH:/var/lib/docker/overlay2/l/CAAX3CEGJZEVBRD3P4SYFI7QPC:/var/lib/docker/overlay2/l/BH2UOZUGVOKKR'
- Unexpected end of /proc/mounts line `WQGZZDSC32IYS:/var/lib/docker/overlay2/l/UVJSVOPPI7SHIC4UPTUIRUTPZU:/var/lib/docker/overlay2/l/TLT2JR4PKKF4QGRVIP4O7CMJYO:/var/lib/docker/overlay2/l/MZJEFEEDOXGJFSFZLG566NTO3Y:/var/lib/docker/overlay2/l/ZCX2FSZRGZPE3WOAFOASMD2P4I:/var/lib/docker/overlay2/l/2YWGQLLQBZH55G7RLBOQLNIR4Y:/var/lib/docker/overlay2/l/RJHP7EDJ2LY3HY6372AB4ZTTXR:/var/lib/docker/overlay2/l/56R6EZGPERS33JNJWNZSAU2XFL:/var/lib/docker/overlay2/l/24MWKHDZOVFTJXNQSB6VKZVYCA:/var/lib/docker/overlay2/l/GXXFWSFVUYHFQIPIERDNZTJ6BG:/var/lib/do'
- Unexpected end of /proc/mounts line `cker/overlay2/l/665DHCXD67GW3OOUHOPVB5A7WY:/var/lib/docker/overlay2/l/6WXUJNDBSXHY3SQICEYD5FZSG4:/var/lib/docker/overlay2/l/6MGX7WXD3YF2QYUH6S3CRPEQ22:/var/lib/docker/overlay2/l/3MKKTE6DVK3UVFCD54OYX73MEH:/var/lib/docker/overlay2/l/2MEC34INSC454ZKZDEOLBZU4MC:/var/lib/docker/overlay2/l/IFHLQG6S5AZJKST3GDMLQV4OHE:/var/lib/docker/overlay2/l/WX6C4Q4KQODE27QXVQIIDGF7EX:/var/lib/docker/overlay2/l/WXR6N7RH53J2GXPIJXAYC4AC3O:/var/lib/docker/overlay2/l/NJEFLSNHSL2I4TDEPMX47I2EGC:/var/lib/docker/overlay2/l/QVEEALTXE'
- Unexpected end of /proc/mounts line `LCUKQ5W4ZKORYY5EE:/var/lib/docker/overlay2/l/CZABMDTTOJOAIJF6DNISWYKKBJ:/var/lib/docker/overlay2/l/BBRS3R7FYEQTDXX25ZPEPEKY27:/var/lib/docker/overlay2/l/4XUQQU3T7BVY2UYH42FHAULLDW:/var/lib/docker/overlay2/l/UXSWXE7CWZC2QFILB3TTDDV5RY:/var/lib/docker/overlay2/l/SOZOXVA7FXV6DHRKGV4LJKWDCS:/var/lib/docker/overlay2/l/X2PLEBUVJLFPN23TR53QFADVUU:/var/lib/docker/overlay2/l/LURMOX6EVXUDHZPNC34DU46J7I:/var/lib/docker/overlay2/l/5OTBDKU4I2RXPQWIR4SGNITNIP,upperdir=/var/lib/docker/overlay2/6044a5653cda1b200d251ad4ff5'
- Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/5D2WMIPB7Q4HAPD5SYELBPXTT3:/var/lib/docker/overlay2/l/PQGXTJGGDXJQP542FYNIQZQ7Z4:/var/lib/docker/overlay2/l/JIZWWCGEZCHUNB2HJHMCPCR63B:/var/lib/docker/overlay2/l/OT2HXLXJKQNY4IWYMEGKXZBSDG:/var/lib/docker/overlay2/l/J66SNH4FDGTUBVAAUEE2WCABEJ:/var/lib/docker/overlay2/l/6MXX7IS7GENJOT3UUDABP3YXWL:/var/lib/docker/overlay2/l/JBWJ4LSU7FDEULVVUO6ZNARDRH:/var/lib/docker/overlay2/l/CAAX3CEGJZEVBRD3P4SYFI7QPC:/var/lib/docker/overlay2/l/BH2UOZUGVOKKR'
- Unexpected end of /proc/mounts line `WQGZZDSC32IYS:/var/lib/docker/overlay2/l/UVJSVOPPI7SHIC4UPTUIRUTPZU:/var/lib/docker/overlay2/l/TLT2JR4PKKF4QGRVIP4O7CMJYO:/var/lib/docker/overlay2/l/MZJEFEEDOXGJFSFZLG566NTO3Y:/var/lib/docker/overlay2/l/ZCX2FSZRGZPE3WOAFOASMD2P4I:/var/lib/docker/overlay2/l/2YWGQLLQBZH55G7RLBOQLNIR4Y:/var/lib/docker/overlay2/l/RJHP7EDJ2LY3HY6372AB4ZTTXR:/var/lib/docker/overlay2/l/56R6EZGPERS33JNJWNZSAU2XFL:/var/lib/docker/overlay2/l/24MWKHDZOVFTJXNQSB6VKZVYCA:/var/lib/docker/overlay2/l/GXXFWSFVUYHFQIPIERDNZTJ6BG:/var/lib/do'
- Unexpected end of /proc/mounts line `cker/overlay2/l/665DHCXD67GW3OOUHOPVB5A7WY:/var/lib/docker/overlay2/l/6WXUJNDBSXHY3SQICEYD5FZSG4:/var/lib/docker/overlay2/l/6MGX7WXD3YF2QYUH6S3CRPEQ22:/var/lib/docker/overlay2/l/3MKKTE6DVK3UVFCD54OYX73MEH:/var/lib/docker/overlay2/l/2MEC34INSC454ZKZDEOLBZU4MC:/var/lib/docker/overlay2/l/IFHLQG6S5AZJKST3GDMLQV4OHE:/var/lib/docker/overlay2/l/WX6C4Q4KQODE27QXVQIIDGF7EX:/var/lib/docker/overlay2/l/WXR6N7RH53J2GXPIJXAYC4AC3O:/var/lib/docker/overlay2/l/NJEFLSNHSL2I4TDEPMX47I2EGC:/var/lib/docker/overlay2/l/QVEEALTXE'
- Unexpected end of /proc/mounts line `LCUKQ5W4ZKORYY5EE:/var/lib/docker/overlay2/l/CZABMDTTOJOAIJF6DNISWYKKBJ:/var/lib/docker/overlay2/l/BBRS3R7FYEQTDXX25ZPEPEKY27:/var/lib/docker/overlay2/l/4XUQQU3T7BVY2UYH42FHAULLDW:/var/lib/docker/overlay2/l/UXSWXE7CWZC2QFILB3TTDDV5RY:/var/lib/docker/overlay2/l/SOZOXVA7FXV6DHRKGV4LJKWDCS:/var/lib/docker/overlay2/l/X2PLEBUVJLFPN23TR53QFADVUU:/var/lib/docker/overlay2/l/LURMOX6EVXUDHZPNC34DU46J7I:/var/lib/docker/overlay2/l/5OTBDKU4I2RXPQWIR4SGNITNIP,upperdir=/var/lib/docker/overlay2/6044a5653cda1b200d251ad4ff5'
- Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/5D2WMIPB7Q4HAPD5SYELBPXTT3:/var/lib/docker/overlay2/l/PQGXTJGGDXJQP542FYNIQZQ7Z4:/var/lib/docker/overlay2/l/JIZWWCGEZCHUNB2HJHMCPCR63B:/var/lib/docker/overlay2/l/OT2HXLXJKQNY4IWYMEGKXZBSDG:/var/lib/docker/overlay2/l/J66SNH4FDGTUBVAAUEE2WCABEJ:/var/lib/docker/overlay2/l/6MXX7IS7GENJOT3UUDABP3YXWL:/var/lib/docker/overlay2/l/JBWJ4LSU7FDEULVVUO6ZNARDRH:/var/lib/docker/overlay2/l/CAAX3CEGJZEVBRD3P4SYFI7QPC:/var/lib/docker/overlay2/l/BH2UOZUGVOKKR'
- Unexpected end of /proc/mounts line `WQGZZDSC32IYS:/var/lib/docker/overlay2/l/UVJSVOPPI7SHIC4UPTUIRUTPZU:/var/lib/docker/overlay2/l/TLT2JR4PKKF4QGRVIP4O7CMJYO:/var/lib/docker/overlay2/l/MZJEFEEDOXGJFSFZLG566NTO3Y:/var/lib/docker/overlay2/l/ZCX2FSZRGZPE3WOAFOASMD2P4I:/var/lib/docker/overlay2/l/2YWGQLLQBZH55G7RLBOQLNIR4Y:/var/lib/docker/overlay2/l/RJHP7EDJ2LY3HY6372AB4ZTTXR:/var/lib/docker/overlay2/l/56R6EZGPERS33JNJWNZSAU2XFL:/var/lib/docker/overlay2/l/24MWKHDZOVFTJXNQSB6VKZVYCA:/var/lib/docker/overlay2/l/GXXFWSFVUYHFQIPIERDNZTJ6BG:/var/lib/do'
- Unexpected end of /proc/mounts line `cker/overlay2/l/665DHCXD67GW3OOUHOPVB5A7WY:/var/lib/docker/overlay2/l/6WXUJNDBSXHY3SQICEYD5FZSG4:/var/lib/docker/overlay2/l/6MGX7WXD3YF2QYUH6S3CRPEQ22:/var/lib/docker/overlay2/l/3MKKTE6DVK3UVFCD54OYX73MEH:/var/lib/docker/overlay2/l/2MEC34INSC454ZKZDEOLBZU4MC:/var/lib/docker/overlay2/l/IFHLQG6S5AZJKST3GDMLQV4OHE:/var/lib/docker/overlay2/l/WX6C4Q4KQODE27QXVQIIDGF7EX:/var/lib/docker/overlay2/l/WXR6N7RH53J2GXPIJXAYC4AC3O:/var/lib/docker/overlay2/l/NJEFLSNHSL2I4TDEPMX47I2EGC:/var/lib/docker/overlay2/l/QVEEALTXE'
- Unexpected end of /proc/mounts line `LCUKQ5W4ZKORYY5EE:/var/lib/docker/overlay2/l/CZABMDTTOJOAIJF6DNISWYKKBJ:/var/lib/docker/overlay2/l/BBRS3R7FYEQTDXX25ZPEPEKY27:/var/lib/docker/overlay2/l/4XUQQU3T7BVY2UYH42FHAULLDW:/var/lib/docker/overlay2/l/UXSWXE7CWZC2QFILB3TTDDV5RY:/var/lib/docker/overlay2/l/SOZOXVA7FXV6DHRKGV4LJKWDCS:/var/lib/docker/overlay2/l/X2PLEBUVJLFPN23TR53QFADVUU:/var/lib/docker/overlay2/l/LURMOX6EVXUDHZPNC34DU46J7I:/var/lib/docker/overlay2/l/5OTBDKU4I2RXPQWIR4SGNITNIP,upperdir=/var/lib/docker/overlay2/6044a5653cda1b200d251ad4ff5'
- Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/5D2WMIPB7Q4HAPD5SYELBPXTT3:/var/lib/docker/overlay2/l/PQGXTJGGDXJQP542FYNIQZQ7Z4:/var/lib/docker/overlay2/l/JIZWWCGEZCHUNB2HJHMCPCR63B:/var/lib/docker/overlay2/l/OT2HXLXJKQNY4IWYMEGKXZBSDG:/var/lib/docker/overlay2/l/J66SNH4FDGTUBVAAUEE2WCABEJ:/var/lib/docker/overlay2/l/6MXX7IS7GENJOT3UUDABP3YXWL:/var/lib/docker/overlay2/l/JBWJ4LSU7FDEULVVUO6ZNARDRH:/var/lib/docker/overlay2/l/CAAX3CEGJZEVBRD3P4SYFI7QPC:/var/lib/docker/overlay2/l/BH2UOZUGVOKKR'
- Unexpected end of /proc/mounts line `WQGZZDSC32IYS:/var/lib/docker/overlay2/l/UVJSVOPPI7SHIC4UPTUIRUTPZU:/var/lib/docker/overlay2/l/TLT2JR4PKKF4QGRVIP4O7CMJYO:/var/lib/docker/overlay2/l/MZJEFEEDOXGJFSFZLG566NTO3Y:/var/lib/docker/overlay2/l/ZCX2FSZRGZPE3WOAFOASMD2P4I:/var/lib/docker/overlay2/l/2YWGQLLQBZH55G7RLBOQLNIR4Y:/var/lib/docker/overlay2/l/RJHP7EDJ2LY3HY6372AB4ZTTXR:/var/lib/docker/overlay2/l/56R6EZGPERS33JNJWNZSAU2XFL:/var/lib/docker/overlay2/l/24MWKHDZOVFTJXNQSB6VKZVYCA:/var/lib/docker/overlay2/l/GXXFWSFVUYHFQIPIERDNZTJ6BG:/var/lib/do'
- Unexpected end of /proc/mounts line `cker/overlay2/l/665DHCXD67GW3OOUHOPVB5A7WY:/var/lib/docker/overlay2/l/6WXUJNDBSXHY3SQICEYD5FZSG4:/var/lib/docker/overlay2/l/6MGX7WXD3YF2QYUH6S3CRPEQ22:/var/lib/docker/overlay2/l/3MKKTE6DVK3UVFCD54OYX73MEH:/var/lib/docker/overlay2/l/2MEC34INSC454ZKZDEOLBZU4MC:/var/lib/docker/overlay2/l/IFHLQG6S5AZJKST3GDMLQV4OHE:/var/lib/docker/overlay2/l/WX6C4Q4KQODE27QXVQIIDGF7EX:/var/lib/docker/overlay2/l/WXR6N7RH53J2GXPIJXAYC4AC3O:/var/lib/docker/overlay2/l/NJEFLSNHSL2I4TDEPMX47I2EGC:/var/lib/docker/overlay2/l/QVEEALTXE'
- Unexpected end of /proc/mounts line `LCUKQ5W4ZKORYY5EE:/var/lib/docker/overlay2/l/CZABMDTTOJOAIJF6DNISWYKKBJ:/var/lib/docker/overlay2/l/BBRS3R7FYEQTDXX25ZPEPEKY27:/var/lib/docker/overlay2/l/4XUQQU3T7BVY2UYH42FHAULLDW:/var/lib/docker/overlay2/l/UXSWXE7CWZC2QFILB3TTDDV5RY:/var/lib/docker/overlay2/l/SOZOXVA7FXV6DHRKGV4LJKWDCS:/var/lib/docker/overlay2/l/X2PLEBUVJLFPN23TR53QFADVUU:/var/lib/docker/overlay2/l/LURMOX6EVXUDHZPNC34DU46J7I:/var/lib/docker/overlay2/l/5OTBDKU4I2RXPQWIR4SGNITNIP,upperdir=/var/lib/docker/overlay2/6044a5653cda1b200d251ad4ff5'
- Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/5D2WMIPB7Q4HAPD5SYELBPXTT3:/var/lib/docker/overlay2/l/PQGXTJGGDXJQP542FYNIQZQ7Z4:/var/lib/docker/overlay2/l/JIZWWCGEZCHUNB2HJHMCPCR63B:/var/lib/docker/overlay2/l/OT2HXLXJKQNY4IWYMEGKXZBSDG:/var/lib/docker/overlay2/l/J66SNH4FDGTUBVAAUEE2WCABEJ:/var/lib/docker/overlay2/l/6MXX7IS7GENJOT3UUDABP3YXWL:/var/lib/docker/overlay2/l/JBWJ4LSU7FDEULVVUO6ZNARDRH:/var/lib/docker/overlay2/l/CAAX3CEGJZEVBRD3P4SYFI7QPC:/var/lib/docker/overlay2/l/BH2UOZUGVOKKR'
- Unexpected end of /proc/mounts line `WQGZZDSC32IYS:/var/lib/docker/overlay2/l/UVJSVOPPI7SHIC4UPTUIRUTPZU:/var/lib/docker/overlay2/l/TLT2JR4PKKF4QGRVIP4O7CMJYO:/var/lib/docker/overlay2/l/MZJEFEEDOXGJFSFZLG566NTO3Y:/var/lib/docker/overlay2/l/ZCX2FSZRGZPE3WOAFOASMD2P4I:/var/lib/docker/overlay2/l/2YWGQLLQBZH55G7RLBOQLNIR4Y:/var/lib/docker/overlay2/l/RJHP7EDJ2LY3HY6372AB4ZTTXR:/var/lib/docker/overlay2/l/56R6EZGPERS33JNJWNZSAU2XFL:/var/lib/docker/overlay2/l/24MWKHDZOVFTJXNQSB6VKZVYCA:/var/lib/docker/overlay2/l/GXXFWSFVUYHFQIPIERDNZTJ6BG:/var/lib/do'
- Unexpected end of /proc/mounts line `cker/overlay2/l/665DHCXD67GW3OOUHOPVB5A7WY:/var/lib/docker/overlay2/l/6WXUJNDBSXHY3SQICEYD5FZSG4:/var/lib/docker/overlay2/l/6MGX7WXD3YF2QYUH6S3CRPEQ22:/var/lib/docker/overlay2/l/3MKKTE6DVK3UVFCD54OYX73MEH:/var/lib/docker/overlay2/l/2MEC34INSC454ZKZDEOLBZU4MC:/var/lib/docker/overlay2/l/IFHLQG6S5AZJKST3GDMLQV4OHE:/var/lib/docker/overlay2/l/WX6C4Q4KQODE27QXVQIIDGF7EX:/var/lib/docker/overlay2/l/WXR6N7RH53J2GXPIJXAYC4AC3O:/var/lib/docker/overlay2/l/NJEFLSNHSL2I4TDEPMX47I2EGC:/var/lib/docker/overlay2/l/QVEEALTXE'
- Unexpected end of /proc/mounts line `LCUKQ5W4ZKORYY5EE:/var/lib/docker/overlay2/l/CZABMDTTOJOAIJF6DNISWYKKBJ:/var/lib/docker/overlay2/l/BBRS3R7FYEQTDXX25ZPEPEKY27:/var/lib/docker/overlay2/l/4XUQQU3T7BVY2UYH42FHAULLDW:/var/lib/docker/overlay2/l/UXSWXE7CWZC2QFILB3TTDDV5RY:/var/lib/docker/overlay2/l/SOZOXVA7FXV6DHRKGV4LJKWDCS:/var/lib/docker/overlay2/l/X2PLEBUVJLFPN23TR53QFADVUU:/var/lib/docker/overlay2/l/LURMOX6EVXUDHZPNC34DU46J7I:/var/lib/docker/overlay2/l/5OTBDKU4I2RXPQWIR4SGNITNIP,upperdir=/var/lib/docker/overlay2/6044a5653cda1b200d251ad4ff5'
- Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/5D2WMIPB7Q4HAPD5SYELBPXTT3:/var/lib/docker/overlay2/l/PQGXTJGGDXJQP542FYNIQZQ7Z4:/var/lib/docker/overlay2/l/JIZWWCGEZCHUNB2HJHMCPCR63B:/var/lib/docker/overlay2/l/OT2HXLXJKQNY4IWYMEGKXZBSDG:/var/lib/docker/overlay2/l/J66SNH4FDGTUBVAAUEE2WCABEJ:/var/lib/docker/overlay2/l/6MXX7IS7GENJOT3UUDABP3YXWL:/var/lib/docker/overlay2/l/JBWJ4LSU7FDEULVVUO6ZNARDRH:/var/lib/docker/overlay2/l/CAAX3CEGJZEVBRD3P4SYFI7QPC:/var/lib/docker/overlay2/l/BH2UOZUGVOKKR'
- Unexpected end of /proc/mounts line `WQGZZDSC32IYS:/var/lib/docker/overlay2/l/UVJSVOPPI7SHIC4UPTUIRUTPZU:/var/lib/docker/overlay2/l/TLT2JR4PKKF4QGRVIP4O7CMJYO:/var/lib/docker/overlay2/l/MZJEFEEDOXGJFSFZLG566NTO3Y:/var/lib/docker/overlay2/l/ZCX2FSZRGZPE3WOAFOASMD2P4I:/var/lib/docker/overlay2/l/2YWGQLLQBZH55G7RLBOQLNIR4Y:/var/lib/docker/overlay2/l/RJHP7EDJ2LY3HY6372AB4ZTTXR:/var/lib/docker/overlay2/l/56R6EZGPERS33JNJWNZSAU2XFL:/var/lib/docker/overlay2/l/24MWKHDZOVFTJXNQSB6VKZVYCA:/var/lib/docker/overlay2/l/GXXFWSFVUYHFQIPIERDNZTJ6BG:/var/lib/do'
- Unexpected end of /proc/mounts line `cker/overlay2/l/665DHCXD67GW3OOUHOPVB5A7WY:/var/lib/docker/overlay2/l/6WXUJNDBSXHY3SQICEYD5FZSG4:/var/lib/docker/overlay2/l/6MGX7WXD3YF2QYUH6S3CRPEQ22:/var/lib/docker/overlay2/l/3MKKTE6DVK3UVFCD54OYX73MEH:/var/lib/docker/overlay2/l/2MEC34INSC454ZKZDEOLBZU4MC:/var/lib/docker/overlay2/l/IFHLQG6S5AZJKST3GDMLQV4OHE:/var/lib/docker/overlay2/l/WX6C4Q4KQODE27QXVQIIDGF7EX:/var/lib/docker/overlay2/l/WXR6N7RH53J2GXPIJXAYC4AC3O:/var/lib/docker/overlay2/l/NJEFLSNHSL2I4TDEPMX47I2EGC:/var/lib/docker/overlay2/l/QVEEALTXE'
- Unexpected end of /proc/mounts line `LCUKQ5W4ZKORYY5EE:/var/lib/docker/overlay2/l/CZABMDTTOJOAIJF6DNISWYKKBJ:/var/lib/docker/overlay2/l/BBRS3R7FYEQTDXX25ZPEPEKY27:/var/lib/docker/overlay2/l/4XUQQU3T7BVY2UYH42FHAULLDW:/var/lib/docker/overlay2/l/UXSWXE7CWZC2QFILB3TTDDV5RY:/var/lib/docker/overlay2/l/SOZOXVA7FXV6DHRKGV4LJKWDCS:/var/lib/docker/overlay2/l/X2PLEBUVJLFPN23TR53QFADVUU:/var/lib/docker/overlay2/l/LURMOX6EVXUDHZPNC34DU46J7I:/var/lib/docker/overlay2/l/5OTBDKU4I2RXPQWIR4SGNITNIP,upperdir=/var/lib/docker/overlay2/6044a5653cda1b200d251ad4ff5'
- Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/5D2WMIPB7Q4HAPD5SYELBPXTT3:/var/lib/docker/overlay2/l/PQGXTJGGDXJQP542FYNIQZQ7Z4:/var/lib/docker/overlay2/l/JIZWWCGEZCHUNB2HJHMCPCR63B:/var/lib/docker/overlay2/l/OT2HXLXJKQNY4IWYMEGKXZBSDG:/var/lib/docker/overlay2/l/J66SNH4FDGTUBVAAUEE2WCABEJ:/var/lib/docker/overlay2/l/6MXX7IS7GENJOT3UUDABP3YXWL:/var/lib/docker/overlay2/l/JBWJ4LSU7FDEULVVUO6ZNARDRH:/var/lib/docker/overlay2/l/CAAX3CEGJZEVBRD3P4SYFI7QPC:/var/lib/docker/overlay2/l/BH2UOZUGVOKKR'
- Unexpected end of /proc/mounts line `WQGZZDSC32IYS:/var/lib/docker/overlay2/l/UVJSVOPPI7SHIC4UPTUIRUTPZU:/var/lib/docker/overlay2/l/TLT2JR4PKKF4QGRVIP4O7CMJYO:/var/lib/docker/overlay2/l/MZJEFEEDOXGJFSFZLG566NTO3Y:/var/lib/docker/overlay2/l/ZCX2FSZRGZPE3WOAFOASMD2P4I:/var/lib/docker/overlay2/l/2YWGQLLQBZH55G7RLBOQLNIR4Y:/var/lib/docker/overlay2/l/RJHP7EDJ2LY3HY6372AB4ZTTXR:/var/lib/docker/overlay2/l/56R6EZGPERS33JNJWNZSAU2XFL:/var/lib/docker/overlay2/l/24MWKHDZOVFTJXNQSB6VKZVYCA:/var/lib/docker/overlay2/l/GXXFWSFVUYHFQIPIERDNZTJ6BG:/var/lib/do'
- Unexpected end of /proc/mounts line `cker/overlay2/l/665DHCXD67GW3OOUHOPVB5A7WY:/var/lib/docker/overlay2/l/6WXUJNDBSXHY3SQICEYD5FZSG4:/var/lib/docker/overlay2/l/6MGX7WXD3YF2QYUH6S3CRPEQ22:/var/lib/docker/overlay2/l/3MKKTE6DVK3UVFCD54OYX73MEH:/var/lib/docker/overlay2/l/2MEC34INSC454ZKZDEOLBZU4MC:/var/lib/docker/overlay2/l/IFHLQG6S5AZJKST3GDMLQV4OHE:/var/lib/docker/overlay2/l/WX6C4Q4KQODE27QXVQIIDGF7EX:/var/lib/docker/overlay2/l/WXR6N7RH53J2GXPIJXAYC4AC3O:/var/lib/docker/overlay2/l/NJEFLSNHSL2I4TDEPMX47I2EGC:/var/lib/docker/overlay2/l/QVEEALTXE'
- Unexpected end of /proc/mounts line `LCUKQ5W4ZKORYY5EE:/var/lib/docker/overlay2/l/CZABMDTTOJOAIJF6DNISWYKKBJ:/var/lib/docker/overlay2/l/BBRS3R7FYEQTDXX25ZPEPEKY27:/var/lib/docker/overlay2/l/4XUQQU3T7BVY2UYH42FHAULLDW:/var/lib/docker/overlay2/l/UXSWXE7CWZC2QFILB3TTDDV5RY:/var/lib/docker/overlay2/l/SOZOXVA7FXV6DHRKGV4LJKWDCS:/var/lib/docker/overlay2/l/X2PLEBUVJLFPN23TR53QFADVUU:/var/lib/docker/overlay2/l/LURMOX6EVXUDHZPNC34DU46J7I:/var/lib/docker/overlay2/l/5OTBDKU4I2RXPQWIR4SGNITNIP,upperdir=/var/lib/docker/overlay2/6044a5653cda1b200d251ad4ff5'
- Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/5D2WMIPB7Q4HAPD5SYELBPXTT3:/var/lib/docker/overlay2/l/PQGXTJGGDXJQP542FYNIQZQ7Z4:/var/lib/docker/overlay2/l/JIZWWCGEZCHUNB2HJHMCPCR63B:/var/lib/docker/overlay2/l/OT2HXLXJKQNY4IWYMEGKXZBSDG:/var/lib/docker/overlay2/l/J66SNH4FDGTUBVAAUEE2WCABEJ:/var/lib/docker/overlay2/l/6MXX7IS7GENJOT3UUDABP3YXWL:/var/lib/docker/overlay2/l/JBWJ4LSU7FDEULVVUO6ZNARDRH:/var/lib/docker/overlay2/l/CAAX3CEGJZEVBRD3P4SYFI7QPC:/var/lib/docker/overlay2/l/BH2UOZUGVOKKR'
- Unexpected end of /proc/mounts line `WQGZZDSC32IYS:/var/lib/docker/overlay2/l/UVJSVOPPI7SHIC4UPTUIRUTPZU:/var/lib/docker/overlay2/l/TLT2JR4PKKF4QGRVIP4O7CMJYO:/var/lib/docker/overlay2/l/MZJEFEEDOXGJFSFZLG566NTO3Y:/var/lib/docker/overlay2/l/ZCX2FSZRGZPE3WOAFOASMD2P4I:/var/lib/docker/overlay2/l/2YWGQLLQBZH55G7RLBOQLNIR4Y:/var/lib/docker/overlay2/l/RJHP7EDJ2LY3HY6372AB4ZTTXR:/var/lib/docker/overlay2/l/56R6EZGPERS33JNJWNZSAU2XFL:/var/lib/docker/overlay2/l/24MWKHDZOVFTJXNQSB6VKZVYCA:/var/lib/docker/overlay2/l/GXXFWSFVUYHFQIPIERDNZTJ6BG:/var/lib/do'
- Unexpected end of /proc/mounts line `cker/overlay2/l/665DHCXD67GW3OOUHOPVB5A7WY:/var/lib/docker/overlay2/l/6WXUJNDBSXHY3SQICEYD5FZSG4:/var/lib/docker/overlay2/l/6MGX7WXD3YF2QYUH6S3CRPEQ22:/var/lib/docker/overlay2/l/3MKKTE6DVK3UVFCD54OYX73MEH:/var/lib/docker/overlay2/l/2MEC34INSC454ZKZDEOLBZU4MC:/var/lib/docker/overlay2/l/IFHLQG6S5AZJKST3GDMLQV4OHE:/var/lib/docker/overlay2/l/WX6C4Q4KQODE27QXVQIIDGF7EX:/var/lib/docker/overlay2/l/WXR6N7RH53J2GXPIJXAYC4AC3O:/var/lib/docker/overlay2/l/NJEFLSNHSL2I4TDEPMX47I2EGC:/var/lib/docker/overlay2/l/QVEEALTXE'
- Unexpected end of /proc/mounts line `LCUKQ5W4ZKORYY5EE:/var/lib/docker/overlay2/l/CZABMDTTOJOAIJF6DNISWYKKBJ:/var/lib/docker/overlay2/l/BBRS3R7FYEQTDXX25ZPEPEKY27:/var/lib/docker/overlay2/l/4XUQQU3T7BVY2UYH42FHAULLDW:/var/lib/docker/overlay2/l/UXSWXE7CWZC2QFILB3TTDDV5RY:/var/lib/docker/overlay2/l/SOZOXVA7FXV6DHRKGV4LJKWDCS:/var/lib/docker/overlay2/l/X2PLEBUVJLFPN23TR53QFADVUU:/var/lib/docker/overlay2/l/LURMOX6EVXUDHZPNC34DU46J7I:/var/lib/docker/overlay2/l/5OTBDKU4I2RXPQWIR4SGNITNIP,upperdir=/var/lib/docker/overlay2/6044a5653cda1b200d251ad4ff5'

Building model and compiling functions...
Synkhronos distributing functions...
...distribution complete.

Looks like something is going wrong. Code terminates after printing "... distribution complete". I am running this on a DGX-1 container. Is this something expected from training with a DGX-1 container???

astooke commented 7 years ago

Hi, thanks for your question! I will be able to address this question in a couple of days, and will do so at my first opportunity. Sorry for the delay!

Nqabz commented 7 years ago

Adam thanks for note!

Here is an update on my question: I installed NCCL 1.3.4 and code seems to compile but not I am getting the following error from synkhronos (still running th e lasagne_mnist.py example:

Starting training...
Traceback (most recent call last):
  File "/home/model_params/synkhronos/demos/lasagne_mnist.py", line 396, in <module>
    main(**kwargs)
  File "/home/model_params/synkhronos/demos/lasagne_mnist.py", line 326, in main
    train_err += train_fn(X_train_synk, y_train_synk, batch=batch)
  File "/opt/conda/lib/python3.5/site-packages/synkhronos/function_module.py", line 487, in __call__
    self._share_input_data(ordered_inputs, batch, batch_s)
  File "/opt/conda/lib/python3.5/site-packages/synkhronos/function_module.py", line 367, in _share_input_data
    scatterer.assign_inputs(synk_inputs, batch, self._n_scat)
  File "/opt/conda/lib/python3.5/site-packages/synkhronos/scatterer.py", line 90, in assign_inputs
    batch = check_batch_types(batch)
  File "/opt/conda/lib/python3.5/site-packages/synkhronos/scatterer.py", line 145, in check_batch_types
    if not np.issubdtype(batch, int):
  File "/opt/conda/lib/python3.5/site-packages/numpy/core/numerictypes.py", line 761, in issubdtype
    return issubclass(dtype(arg1).type, val)
TypeError: data type not understood

Thanks,

Nqabz

astooke commented 7 years ago

This error should be fixed with the latest push.

I will review the lasagne-mnist demo and might update it soon. :)

Nqabz commented 7 years ago

Great. Will be appreciated if you can update the lasagne-mnist demo too. I will check-in the latest push for the other demos and let you know.

astooke commented 7 years ago

OK I just updated the lasagne_mnist demos and they should all be working. While in your synkhronos directory, use "demos/run_lasagne_demos.py" to get a quick view of the speedup. You might need to use a large batch size for good speedup.

Then go into the demo files and see how to set it up. :)

Nqabz commented 7 years ago

Adam do you mind If I ask how you built your Python3 and Theano to be able to work with latest NCCL toolkit? Are you using NCCLv2?

Nqabz commented 7 years ago

Are you able to share your DGX container's dockerfile minus proprietary packages that you maybe using?

Nqabz commented 7 years ago

Thanks so much on updating lasagne-mnist I will test it tomorrow.

astooke commented 7 years ago

The exact Theano version I'm using is: '0.9.0.dev-e79c4e4c83c5a4907ea7fddf073fd2d659df7486' I have maybe one or two things I've tweaked for convenience but nothing for functionality.

I'm using NCCL 1.3.2 on a 2-GPU workstation. I also have v1 on a DGX-1, no v2 yet.

I haven't pulled Theano in a long time, but I'll do that soon as I think some fixes in 0.10 will help with bugs found through here. :)

Also haven't set up a DGX docker file, have just been running directly in the machine.

Nqabz commented 7 years ago

Ok now makes sense when you mention that you are using NCCL 1.3.2. I went for v2 and the API has changed quite a lot. With the API changes in v2, pygpu package is acting out:(.

I will write back to this thread once I have some answers. I have reached out to NVIDIA Enterprise support for this.

That so much for the great work!

Nqabz commented 7 years ago

I might look at contributing the docker-file if it can be of use to someone.

Nqabz commented 7 years ago

I did a quick test of the new code (pushed on August 23) using a docker container.

Building and compiling is quick. distribution of functions completes in ~43s for the cnn model. Thereafter the training does not start seem to have an indefinite wait ...code hangs. Is it supposed to wait for more than 10 minutes??

On comparing to with the old code the very same 'cnn' model: distribution of functions (takes 33s) and thereafter (in less than a second) training begins on all 8gpus.

I am not sure what to make of these discrepancies?

astooke commented 7 years ago

That sounds like a problem! I've been testing on my 2-GPU machine, I should be able to get back on the DGX in a day or so, and will look at this right away. You could try running with synk.fork(n_gpus=2) and see what happens.

In the meantime, a quick hack to speed up the distribution is to change Theano/theano/gpuarray/dnn.py, maybe somewhere around line 275, where it says version.v = None. And change it to say version.v = 6020 or whatever version of cudnn you have. This is how I run. The long synk.distribute() time for 8 GPUs comes from them all fighting for the compile lock while figuring out the cudnn version.

Nqabz commented 7 years ago

Adam thanks for the tips. I can confirm that "lasagne_mnist_gpu_data.py" "cnn" model gives a clean run with synk.fork(2),synk.fork(3), ... ,synk.fork(6),synk.fork(7)`

However with 8 gpus the wait goes on forever and no training takes place on the "cnn" model. Here is a snapshot of my terminal after 20 minutes of launching with `synk.fork(8)`:

Mapped name None to device cuda7: Tesla P100-SXM2-16GB (0000:8A:00.0)
Synkhronos: 8 GPUs initialized, master rank: 0
Building model and compiling functions...
Synkhronos distributing functions...
...distribution complete (42 s).

Could this be because the code need to reserve one gpu to fork other subprocesses on the remaining gpus? I thought forking was happening from the cpu. Very strangely withsynk.fork(8) code does train to completion for the "mlp" model.

astooke commented 7 years ago

I am able to recreate this problem. And I've pinned it to a faulty socket connection inside the cpu-comm unit (uses ZeroMQ), which is used in scatter(). Mysterious behavior, as it doesn't happen every time, but it is always the last GPU.

An equally valid alternative that avoids using the ZeroMQ-based cpu-comm unit is to first build a synk data object, and pass that to the scatter command. For example, replace: synk.scatter(x_var, some_numpy_array) with data = synk.data(some_numpy_array); synk.scatter(x_var, data).

I'll keep you posted as I figure out what's going on.

astooke commented 7 years ago

OK found it. There was a tiny typo breaking the ZeroMQ socket connection of the last worker. Pushed the fix already, should be working now.

Thanks for finding that!

Nqabz commented 7 years ago

Adam, Thanks for getting back to this. I did a quick test of the new push. My run still gets stuck in ZeroMQ for the 'cnn' model. Did you check with the 'cnn' model as default? I see your push still has the 'mlp' as default.

Using cuDNN version 6021 on context None
Mapped name None to device cuda5: Tesla P100-SXM2-16GB (0000:86:00.0)
Synkhronos: 8 GPUs initialized, master rank: 0
Building model and compiling functions...
Synkhronos distributing functions...
...distribution complete (41 s).
Scattering data to GPUs.

Nqabz commented 7 years ago

Strangely this script'lasagne_mnist_cpu_data.py' runs on all 8gpus while the 'lasagne_mnist_gpu_data.py' runs on 7gpus. Internally the two scripts look identical.

Earlier you suggested this test "demos/run_lasagne_demos.py".

how does do you differentiate between the two scripts: 'lasagne_mnist_cpu_data.py' and 'lasagne_mnist_gpu_data.py' given that they both fork and bind to gpus.

astooke commented 7 years ago

Yes, mine runs with the cnn model, all 8 GPUs. Can you check that in your synkhronos/comm.py, line 64, it now says:

pub_port = pub_socket.bind_to_random_port(

Previously it was:

pub_port = socket.bind_to_random_port(

This should be all that is needed to fix. If not, please let me know and we'll reopen!

The lasagne_mnist_cpu_data.py script does not use the ZeroMQ-based communication.

The difference between the scripts is that in the cpu_data one, all the data is held on the CPU, and sent to GPU at each function call. The gpu_data script puts all the data on the GPUs ahead of time, and simply sends the GPUs a set of random indexes at each function call. For large enough batch size, should see some speedup in gpu_data.

Nqabz commented 7 years ago

Indeed it is set to pub_port = pub_socket.bind_to_random_port( What version of ZeroMQ are you using? Perhaps I can check to match. Thanks for clarifying the difference between ..._cpu_data.py and ..._gpu_data.py.

astooke commented 7 years ago

You're welcome!

Hmm this is interesting then... pyzmq 16.0.2 (latest version in conda for python 3.5) Let me switch over to another computer and I'll push a test I was using...

astooke commented 7 years ago

Ok there is now a test at tests/zeromq_test.py, which does the same thing as starting up zeromq in synkhronos (but the test does not call anything in synkhronos). Give that a try and let me know the result...it should run through 7 workers, have them all receive a test string, and then exit.

Edit: you can also try tests/cpu_comm_test.py which does use synkhronos but is much simpler test than the lasagne demo.

Nqabz commented 7 years ago

Just ran the tests/cpu_comm_test.py this morning. It appears to hace completed successfully. Is the following in line with what you expected?

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

master had pair port  1 :  49836
master had pair port  2 :  50986
master had pair port  3 :  28296
master had pair port  4 :  63531
master had pair port  5 :  20008
master had pair port  6 :  57752
master had pair port  7 :  18118
skipping port idx  0
sending test string in loop,  1
1  connecting to port  49836
1  polling for test string
sending test string in loop,  2
1  result of poll:  1
1  attempting to receive string
1  passed recv test
2  connecting to port  50986
2  polling for test string
sending test string in loop,  3
2  result of poll:  1
2  attempting to receive string
2  passed recv test
3  connecting to port  28296
3  polling for test string
sending test string in loop,  4
3  result of poll:  1
3  attempting to receive string
3  passed recv test
4  connecting to port  63531
4  polling for test string
sending test string in loop,  5
4  result of poll:  1
4  attempting to receive string
4  passed recv test
5  connecting to port  20008
5  polling for test string
sending test string in loop,  6
5  result of poll:  1
5  attempting to receive string
5  passed recv test
6  connecting to port  57752
6  polling for test string
sending test string in loop,  7
6  result of poll:  1
6  attempting to receive string
6  passed recv test
7  connecting to port  18118
7  polling for test string
dont with test string loop
7  result of poll:  1
7  attempting to receive string
7  passed recv test

astooke commented 7 years ago

Yes that looks correct from zeromq_test.py. And cpu_comm_test.py should run and close without any output, but not hang.

If both of these work for you but lasagne_mnist_gpu_data.py still hangs when scattering data...then this is mysterious. Do you have your synkhronos pip installed as editable?

dlunga commented 7 years ago

I just checked cpu_comm_test.py hangs as well??

I do have synkhronos installed in a docker container, not sure yet if its editable. I will check? Is there something you would suggest that I change?

astooke commented 7 years ago

Ok I'm not sure how it works with docker but my guess is that the container has not incorporated the updates to synkhronos. ?

When I run in native OS (including in a conda env) you can do pip install -e . (with the period) from your local folder of the git repo. Then when you git pull all the changes are applied without having to reinstall with pip.

Nqabz commented 7 years ago

Ummm... that might be the problem. In a day or two I will rebuild the container and retest.

astooke commented 7 years ago

ok sounds good, let me know the result.

also beware I just renamed the repository from "synkhronos" to "Synkhronos", in keeping with Theano, Lasagne, etc.

Nqabz commented 7 years ago

@astooke Both tests works after rebuilding the container. However, I have a follow up to make. Not sure if I should open another issue.

It seems building model and compiling functions... times out after a few seconds when running the current cnn model with my data.

Using cuDNN version 6021 on context None
Mapped name None to device cuda5: Tesla P100-SXM2-16GB (0000:86:00.0)
Synkhronos: 6 GPUs initialized, master rank: 0
Building model and compiling functions...

stops after compiling. My training data is of this size:

X_train is : (16000, 1, 144, 144)
y_train is : (16000,)

Seems the code does not get past the writing of data into shared memory:

 # Write data into input shared memory 
    X_train_synk, y_train_synk = train_fn.build_inputs(X_train, y_train)

Nqabz commented 7 years ago

Never mind ... found the culprit. I had to increase --shm-size to --shm-size="2048m".

I am shooting for testing my large models on a cluster with more than 2000 nodes (one node has 4 P100 GPU cards). What are your experiences with scaling your Synkhronos beyond one DGX-1 box?

Nqabz commented 7 years ago

There is a bug in your latest push fortrain_mnist_gpu_data.py , i think its due to change of folder structure and some typos.

grad_updates, param_updates, grad_shared = updates.nesterov_momentum(
        loss, params, learning_rate=0.01, momentum=0.9)

learning_rate is not a keyword argument anymore in the imported instance of updates. Latest pushed code works based on positional argument for the learning rate.

I have resorted to fix my directory structure to be based your previous push and its working fine.

astooke commented 7 years ago

Hmm, it is running for me, so maybe that was a temporary problem in the middle of a bunch of updates?

I had previously programmed the kwarg lr but changed it to learning_rate in keeping with Lasagne.

Thanks for being patient as things are changing rapidly...I hope this has settled out and I'm only doing documentation and typo fixes now.

Nqabz commented 7 years ago

Thanks to your for tips and the great package.

What are your experiences with scaling Synkhronos beyond one DGX-1 box? I am looking at 2000 nodes each with 4 P100 GPU cards) and training with more than 40Gig of image data?

astooke commented 7 years ago

Wow that is a lot of distributed compute! Starting a new issue (#12) for that discussion.

Is the issue with the demo fixed so we can close this?

Nqabz commented 7 years ago

Correct - lets close this issue. The demo works and has helped shape my current models. Thanks for opening the issue to discuss multinode support.

astooke / Synkhronos