bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.63k stars 488 forks source link

BytePS w/ MXNet doesn't work w/o docker container #222

Closed access2rohit closed 4 years ago

access2rohit commented 4 years ago

Describe the bug If i try to run on bare machine i cannot run the MXNet example: https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md#mxnet

But if I use the container provided then I am able to run the example.

To Reproduce Steps to reproduce the behavior:

  1. Ubuntu 16.04 DLAMI EC2 instance p2.8xlarge (8 k80gpus)
  2. pip install mxnet-cu100mkl
  3. pip install byteps==0.2.0
  4. git clone --recursive https://github.com/bytedance/byteps.git ~/byteps
  5. run following commands on shell:

export NVIDIA_VISIBLE_DEVICES=0,1,2,3 # gpus list export DMLC_WORKER_ID=0 # your worker id export DMLC_NUM_WORKER=1 # one worker export DMLC_ROLE=worker

export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.0.0.1 export DMLC_PS_ROOT_PORT=1234

bpslaunch python3 ~/byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32

6. See error as shown in logs

**Expected behavior**
To run mxnet example

**Logs**

(mx_byteps) ubuntu@ip-172-31-85-4:~$ bpslaunch python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32 BytePS launching worker INFO:root:start with arguments Namespace(batch_size=32, benchmark=1, cpu_train=False, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001) INFO:root:start with arguments Namespace(batch_size=32, benchmark=1, cpu_train=False, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001) INFO:root:start with arguments Namespace(batch_size=32, benchmark=1, cpu_train=False, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001) INFO:root:start with arguments Namespace(batch_size=32, benchmark=1, cpu_train=False, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001) INFO:root:Launch BytePS process on GPU-2 learning rate from lr_scheduler has been overwritten by learning_rate in optimizer. INFO:root:Launch BytePS process on GPU-0 INFO:root:Launch BytePS process on GPU-1 learning rate from lr_scheduler has been overwritten by learning_rate in optimizer. learning rate from lr_scheduler has been overwritten by learning_rate in optimizer. environ({'LESSOPEN': '| /usr/bin/lesspipe %s', 'CONDA_PROMPT_MODIFIER': '(mx_byteps) ', 'BYTEPS_LOCAL_RANK': '2', 'MAIL': '/var/mail/ubuntu', 'SSH_CLIENT': '76.126.245.87 59732 22', 'USER': 'ubuntu', 'LD_LIBRARY_PATH_WITH_DEFAULT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:', 'LD_LIBRARY_PATH': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/home/ubuntu/src/cntk/bindings/python/cntk/libs:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/opt/amazon/efa/lib:/usr/local/mpi/lib:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:', 'SHLVL': '1', 'CONDA_SHLVL': '1', 'HOME': '/home/ubuntu', 'SSH_TTY': '/dev/pts/1', 'DMLC_PS_ROOT_URI': '10.0.0.1', 'LC_TERMINAL_VERSION': '3.3.9', 'DMLC_NUM_SERVER': '1', 'LOGNAME': 'ubuntu', 'DMLC_PS_ROOTPORT': '1234', '': '/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch', 'BYTEPS_LOCAL_SIZE': '4', 'PKG_CONFIG_PATH': '/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:', 'XDG_SESSION_ID': '3', 'TERM': 'xterm-256color', 'DMLC_NUM_WORKER': '1', 'PATH': '/home/ubuntu/anaconda3/envs/mx_byteps/bin:/home/ubuntu/anaconda3/bin/:/home/ubuntu/bin:/home/ubuntu/.local/bin:/home/ubuntu/anaconda3/bin/:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/home/ubuntu/src/cntk/bin:/usr/local/mpi/bin:/opt/aws/neuron/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'DMLC_ROLE': 'worker', 'XDG_RUNTIME_DIR': '/run/user/1000', 'LANG': 'en_US.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.Z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.jpg=01;35:.jpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:.xspf=00;36:', 'CONDA_PYTHON_EXE': '/home/ubuntu/anaconda3/bin/python', 'SHELL': '/bin/bash', 'CONDA_DEFAULT_ENV': 'mx_byteps', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'MODULE_VERSION': '3.2.10', 'LD_LIBRARY_PATH_WITHOUT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:', 'LC_TERMINAL': 'iTerm2', 'MODULE_VERSION_STACK': '3.2.10', 'PWD': '/home/ubuntu', 'LOADEDMODULES': '', 'CONDA_EXE': '/home/ubuntu/anaconda3/bin/conda', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'SSH_CONNECTION': '76.126.245.87 59732 172.31.85.4 22', 'PYTHONPATH': '/home/ubuntu/src/cntk/bindings/python', 'DMLC_WORKER_ID': '0', 'NVIDIA_VISIBLE_DEVICES': '0,1,2,3', 'CONDA_PREFIX': '/home/ubuntu/anaconda3/envs/mx_byteps', 'MANPATH': '/opt/aws/neuron/share/man:', 'MODULEPATH': '/etc/environment-modules/modules:/usr/share/modules/versions:/usr/Modules/$MODULE_VERSION/modulefiles:/usr/share/modules/modulefiles', 'MODULESHOME': '/usr/share/modules'})=============2 INFO:root:Launch BytePS process on GPU-3 environ({'LESSOPEN': '| /usr/bin/lesspipe %s', 'CONDA_PROMPT_MODIFIER': '(mx_byteps) ', 'BYTEPS_LOCAL_RANK': '1', 'MAIL': '/var/mail/ubuntu', 'SSH_CLIENT': '76.126.245.87 59732 22', 'USER': 'ubuntu', 'LD_LIBRARY_PATH_WITH_DEFAULT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:', 'LD_LIBRARY_PATH': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/home/ubuntu/src/cntk/bindings/python/cntk/libs:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/opt/amazon/efa/lib:/usr/local/mpi/lib:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:', 'SHLVL': '1', 'CONDA_SHLVL': '1', 'HOME': '/home/ubuntu', 'SSH_TTY': '/dev/pts/1', 'DMLC_PS_ROOT_URI': '10.0.0.1', 'LC_TERMINAL_VERSION': '3.3.9', 'DMLC_NUM_SERVER': '1', 'LOGNAME': 'ubuntu', 'DMLC_PS_ROOTPORT': '1234', '': '/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch', 'BYTEPS_LOCAL_SIZE': '4', 'PKG_CONFIG_PATH': '/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:', 'XDG_SESSION_ID': '3', 'TERM': 'xterm-256color', 'DMLC_NUM_WORKER': '1', 'PATH': '/home/ubuntu/anaconda3/envs/mx_byteps/bin:/home/ubuntu/anaconda3/bin/:/home/ubuntu/bin:/home/ubuntu/.local/bin:/home/ubuntu/anaconda3/bin/:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/home/ubuntu/src/cntk/bin:/usr/local/mpi/bin:/opt/aws/neuron/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'DMLC_ROLE': 'worker', 'XDG_RUNTIME_DIR': '/run/user/1000', 'LANG': 'en_US.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.Z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.jpg=01;35:.jpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:.xspf=00;36:', 'CONDA_PYTHON_EXE': '/home/ubuntu/anaconda3/bin/python', 'SHELL': '/bin/bash', 'CONDA_DEFAULT_ENV': 'mx_byteps', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'MODULE_VERSION': '3.2.10', 'LD_LIBRARY_PATH_WITHOUT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:', 'LC_TERMINAL': 'iTerm2', 'MODULE_VERSION_STACK': '3.2.10', 'PWD': '/home/ubuntu', 'LOADEDMODULES': '', 'CONDA_EXE': '/home/ubuntu/anaconda3/bin/conda', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'SSH_CONNECTION': '76.126.245.87 59732 172.31.85.4 22', 'PYTHONPATH': '/home/ubuntu/src/cntk/bindings/python', 'DMLC_WORKER_ID': '0', 'NVIDIA_VISIBLE_DEVICES': '0,1,2,3', 'CONDA_PREFIX': '/home/ubuntu/anaconda3/envs/mx_byteps', 'MANPATH': '/opt/aws/neuron/share/man:', 'MODULEPATH': '/etc/environment-modules/modules:/usr/share/modules/versions:/usr/Modules/$MODULE_VERSION/modulefiles:/usr/share/modules/modulefiles', 'MODULESHOME': '/usr/share/modules'})=============1 learning rate from lr_scheduler has been overwritten by learning_rate in optimizer. environ({'LESSOPEN': '| /usr/bin/lesspipe %s', 'CONDA_PROMPT_MODIFIER': '(mx_byteps) ', 'BYTEPS_LOCAL_RANK': '0', 'MAIL': '/var/mail/ubuntu', 'SSH_CLIENT': '76.126.245.87 59732 22', 'USER': 'ubuntu', 'LD_LIBRARY_PATH_WITH_DEFAULT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:', 'LD_LIBRARY_PATH': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/home/ubuntu/src/cntk/bindings/python/cntk/libs:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/opt/amazon/efa/lib:/usr/local/mpi/lib:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:', 'SHLVL': '1', 'CONDA_SHLVL': '1', 'HOME': '/home/ubuntu', 'SSH_TTY': '/dev/pts/1', 'DMLC_PS_ROOT_URI': '10.0.0.1', 'LC_TERMINAL_VERSION': '3.3.9', 'DMLC_NUM_SERVER': '1', 'LOGNAME': 'ubuntu', 'DMLC_PS_ROOTPORT': '1234', '': '/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch', 'BYTEPS_LOCAL_SIZE': '4', 'PKG_CONFIG_PATH': '/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:', 'XDG_SESSION_ID': '3', 'TERM': 'xterm-256color', 'DMLC_NUM_WORKER': '1', 'PATH': '/home/ubuntu/anaconda3/envs/mx_byteps/bin:/home/ubuntu/anaconda3/bin/:/home/ubuntu/bin:/home/ubuntu/.local/bin:/home/ubuntu/anaconda3/bin/:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/home/ubuntu/src/cntk/bin:/usr/local/mpi/bin:/opt/aws/neuron/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'DMLC_ROLE': 'worker', 'XDG_RUNTIME_DIR': '/run/user/1000', 'LANG': 'en_US.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.Z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.jpg=01;35:.jpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:.xspf=00;36:', 'CONDA_PYTHON_EXE': '/home/ubuntu/anaconda3/bin/python', 'SHELL': '/bin/bash', 'CONDA_DEFAULT_ENV': 'mx_byteps', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'MODULE_VERSION': '3.2.10', 'LD_LIBRARY_PATH_WITHOUT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:', 'LC_TERMINAL': 'iTerm2', 'MODULE_VERSION_STACK': '3.2.10', 'PWD': '/home/ubuntu', 'LOADEDMODULES': '', 'CONDA_EXE': '/home/ubuntu/anaconda3/bin/conda', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'SSH_CONNECTION': '76.126.245.87 59732 172.31.85.4 22', 'PYTHONPATH': '/home/ubuntu/src/cntk/bindings/python', 'DMLC_WORKER_ID': '0', 'NVIDIA_VISIBLE_DEVICES': '0,1,2,3', 'CONDA_PREFIX': '/home/ubuntu/anaconda3/envs/mx_byteps', 'MANPATH': '/opt/aws/neuron/share/man:', 'MODULEPATH': '/etc/environment-modules/modules:/usr/share/modules/versions:/usr/Modules/$MODULE_VERSION/modulefiles:/usr/share/modules/modulefiles', 'MODULESHOME': '/usr/share/modules'})=============0 environ({'LESSOPEN': '| /usr/bin/lesspipe %s', 'CONDA_PROMPT_MODIFIER': '(mx_byteps) ', 'BYTEPS_LOCAL_RANK': '3', 'MAIL': '/var/mail/ubuntu', 'SSH_CLIENT': '76.126.245.87 59732 22', 'USER': 'ubuntu', 'LD_LIBRARY_PATH_WITH_DEFAULT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:', 'LD_LIBRARY_PATH': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/home/ubuntu/src/cntk/bindings/python/cntk/libs:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/opt/amazon/efa/lib:/usr/local/mpi/lib:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:', 'SHLVL': '1', 'CONDA_SHLVL': '1', 'HOME': '/home/ubuntu', 'SSH_TTY': '/dev/pts/1', 'DMLC_PS_ROOT_URI': '10.0.0.1', 'LC_TERMINAL_VERSION': '3.3.9', 'DMLC_NUM_SERVER': '1', 'LOGNAME': 'ubuntu', 'DMLC_PS_ROOTPORT': '1234', '': '/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch', 'BYTEPS_LOCAL_SIZE': '4', 'PKG_CONFIG_PATH': '/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:', 'XDG_SESSION_ID': '3', 'TERM': 'xterm-256color', 'DMLC_NUM_WORKER': '1', 'PATH': '/home/ubuntu/anaconda3/envs/mx_byteps/bin:/home/ubuntu/anaconda3/bin/:/home/ubuntu/bin:/home/ubuntu/.local/bin:/home/ubuntu/anaconda3/bin/:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/home/ubuntu/src/cntk/bin:/usr/local/mpi/bin:/opt/aws/neuron/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'DMLC_ROLE': 'worker', 'XDG_RUNTIME_DIR': '/run/user/1000', 'LANG': 'en_US.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.Z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.jpg=01;35:.jpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:.xspf=00;36:', 'CONDA_PYTHON_EXE': '/home/ubuntu/anaconda3/bin/python', 'SHELL': '/bin/bash', 'CONDA_DEFAULT_ENV': 'mx_byteps', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'MODULE_VERSION': '3.2.10', 'LD_LIBRARY_PATH_WITHOUT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:', 'LC_TERMINAL': 'iTerm2', 'MODULE_VERSION_STACK': '3.2.10', 'PWD': '/home/ubuntu', 'LOADEDMODULES': '', 'CONDA_EXE': '/home/ubuntu/anaconda3/bin/conda', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'SSH_CONNECTION': '76.126.245.87 59732 172.31.85.4 22', 'PYTHONPATH': '/home/ubuntu/src/cntk/bindings/python', 'DMLC_WORKER_ID': '0', 'NVIDIA_VISIBLE_DEVICES': '0,1,2,3', 'CONDA_PREFIX': '/home/ubuntu/anaconda3/envs/mx_byteps', 'MANPATH': '/opt/aws/neuron/share/man:', 'MODULEPATH': '/etc/environment-modules/modules:/usr/share/modules/versions:/usr/Modules/$MODULE_VERSION/modulefiles:/usr/share/modules/modulefiles', 'MODULESHOME': '/usr/share/modules'})=============3

Segmentation fault: 11

Segmentation fault: 11

Stack trace: [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7fe075c25100] [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7fe1021c34b0] [bt] (2) /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7fe102561d44] [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389d737) [0x7fe07531a737] [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0863) [0x7fe07531d863] [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7fe075313551] [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(MXEnginePushAsync+0x2f7) [0x7fe075279a67] [bt] (7) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps_mxnet_push_pull_async+0x150) [0x7fdff9762970] [bt] (8) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7fe1012dfec0] Stack trace: [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7f3556934100] [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f35e2ed24b0] [bt] (2) /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f35e3270d44] [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389d737) [0x7f3556029737] [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0863) [0x7f355602c863] [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7f3556022551] [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(MXEnginePushAsync+0x2f7) [0x7f3555f88a67] [bt] (7) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps_mxnet_push_pull_async+0x150) [0x7f34e1762970] [bt] (8) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f35e1feeec0] [2020-03-16 21:41:59*** Error in `.956268: F byteps/common/core_loops.cc:299] Check failed: r == ncclSuccess NCCL error: unhandled cuda error

Segmentation fault: 11

Stack trace: [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7fefa3442100] [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7ff02f9e04b0] [bt] (2) /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7ff02fd7ed44] [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389d737) [0x7fefa2b37737] [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0863) [0x7fefa2b3a863] [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7fefa2b30551] [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(MXEnginePushAsync+0x2f7) [0x7fefa2a96a67] [bt] (7) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps_mxnet_push_pull_async+0x150) [0x7fef2d762970] [bt] (8) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7ff02eafcec0]

Segmentation fault: 11

Segmentation fault: 11

Stack trace: [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7f14f6979100] [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f1582f174b0] [bt] (2) /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f15832b5d44] [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389d737) [0x7f14f606e737] [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0863) [0x7f14f6071863] [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7f14f6067551] [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(MXEnginePushAsync+0x2f7) [0x7f14f5fcda67] [bt] (7) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps_mxnet_push_pull_async+0x150) [0x7f1481762970] [bt] (8) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f1582033ec0]

Segmentation fault: 11

Segmentation fault: 11

Stack trace: [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7f14f6979100] [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f1582f174b0] [bt] (2) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389f261) [0x7f14f6070261] [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0611) [0x7f14f6071611] [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7f14f6067551] [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38974a4) [0x7f14f60684a4] [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Chunk::~Chunk()+0x48a) [0x7f14f629056a] [bt] (7) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x6d860a) [0x7f14f2ea960a] [bt] (8) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3ab7101) [0x7f14f6288101] Stack trace: [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7f14f6979100] [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f1582f174b0] [bt] (2) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389f261) [0x7f14f6070261] [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0611) [0x7f14f6071611] [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7f14f6067551] [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38974a4) [0x7f14f60684a4] [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Chunk::~Chunk()+0x48a) [0x7f14f629056a] [bt] (7) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x6d860a) [0x7f14f2ea960a] [bt] (8) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3ab7101) [0x7f14f6288101] Aborted (core dumped) Exception in thread Thread-4: Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch", line 47, in worker subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True) File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/subprocess.py", line 311, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32' returned non-zero exit status 134.

Segmentation fault (core dumped) Exception in thread Thread-3: Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch", line 47, in worker subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True) File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/subprocess.py", line 311, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32' returned non-zero exit status 139.

Segmentation fault (core dumped) Exception in thread Thread-1: Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch", line 47, in worker subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True) File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/subprocess.py", line 311, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32' returned non-zero exit status 139.

Segmentation fault (core dumped) Exception in thread Thread-2: Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch", line 47, in worker subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True) File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/subprocess.py", line 311, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32' returned non-zero exit status 139.



**Environment (please complete the following information):**
 - OS: Ubuntu 16.04
 - GCC version: 5.4.0
 - CUDA and NCCL version: CUDA 10.0 and NCCL 2.4.7
 - Framework (TF, PyTorch, MXNet): MXNet 
ymjiang commented 4 years ago

Can you use pip3 install byteps==0.2.2? What is the MXNet version number?

access2rohit commented 4 years ago

MXNet 1.6.0 cu100-mkl python and pip are by default linked to python 3.6

bobzhuyb commented 4 years ago

@access2rohit Can you try byteps==0.2.2? It has better compatibility with various versions of gcc.

You can compare your installation steps with those in the dockerfile. https://github.com/bytedance/byteps/blob/master/docker/Dockerfile#L46

access2rohit commented 4 years ago

with bytePS==0.2.2 I get the following error:

*** Error in `python': double free or corruption (!prev): 0x0000558d421beda0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f14b56fc7e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f14b570537a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f14b570953c]
/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(_ZN6byteps6common11NcclManagerD1Ev+0xd9)[0x7f13b1752f89]
/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x35996)[0x7f13b1735996]
/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(_ZN6byteps6common15RunSyncNcclOnceEv+0x4e)[0x7f13b173fa3e]
/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(_ZN6byteps6common12SyncNcclLoopEv+0x3d)[0x7f13b173fc5d]
/home/ubuntu/anaconda3/envs/mx_byteps/bin/../lib/libstdc++.so.6(+0xc8421)[0x7f14b0a6a421]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f14b5a566ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f14b578c41d]
======= Memory map: ========
200000000-200100000 rw-s 00000000 00:06 324                              /dev/nvidiactl
200100000-200300000 rw-s 00000000 00:06 324                              /dev/nvidiactl
200300000-202b00000 rw-s 00000000 00:06 324                              /dev/nvidiactl
202b00000-202c00000 rw-s 00000000 00:05 35291                            /dev/zero (deleted)
202c00000-202d00000 rw-s 00000000 00:06 324                              /dev/nvidiactl
202d00000-202e00000 rw-s 00000000 00:05 35292                            /dev/zero (deleted)
202e00000-202f00000 rw-s 00000000 00:06 324                              /dev/nvidiactl
202f00000-202f20000 ---p 00000000 00:00 0
202f20000-203020000 rw-s 00000000 00:06 324                              /dev/nvidiactl
203020000-2030a2000 rw-s 00000000 00:06 324                              /dev/nvidiactl
2030a2000-2030c0000 ---p 00000000 00:00 0
2030c0000-2031c0000 rw-s 00000000 00:05 35294                            /dev/zero (deleted)
2031c0000-2032c0000 rw-s 00000000 00:05 29045                            /dev/zero (deleted)
2032c0000-203c80000 ---p 00000000 00:00 0
203c80000-204ee0000 rw-s 00000000 00:05 29062                            /dev/zero (deleted)
204ee0000-204fe0000 rw-s 00000000 00:05 29063                            /dev/zero (deleted)
204fe0000-9400000000 ---p 00000000 00:00 0
558d3f833000-558d3f88a000 r--p 00000000 ca:01 13826841                   /home/ubuntu/anaconda3/envs/mx_byteps/bin/python3.6
558d3f88a000-558d3fa50000 r-xp 00057000 ca:01 13826841                   /home/ubuntu/anaconda3/envs/mx_byteps/bin/python3.6
558d3fa50000-558d3faed000 r--p 0021d000 ca:01 13826841                   /home/ubuntu/anaconda3/envs/mx_byteps/bin/python3.6
558d3faee000-558d3faf1000 r--p 002ba000 ca:01 13826841                   /home/ubuntu/anaconda3/envs/mx_byteps/bin/python3.6
558d3faf1000-558d3fb54000 rw-p 002bd000 ca:01 13826841                   /home/ubuntu/anaconda3/envs/mx_byteps/bin/python3.6
558d3fb54000-558d3fb85000 rw-p 00000000 00:00 0
558d40b7d000-558d7585d000 rw-p 00000000 00:00 0                          [heap]
558d7585d000-558d76abd000 rw-p 00000000 00:00 0                          [heap]
558d76abd000-558d7b7fe000 rw-p 00000000 00:00 0                          [heap]
7f1364000000-7f136c000000 ---p 00000000 00:00 0
7f136c000000-7f136c02b000 rw-p 00000000 00:00 0
7f136c02b000-7f1370000000 ---p 00000000 00:00 0
7f1370000000-7f137002b000 rw-p 00000000 00:00 0
7f137002b000-7f1374000000 ---p 00000000 00:00 0
7f1374000000-7f1376a0e000 rw-p 00000000 00:00 0
7f1376a0e000-7f1378000000 ---p 00000000 00:00 0
7f13797fb000-7f13797fc000 ---p 00000000 00:00 0
7f13797fc000-7f1379ffc000 rwxp 00000000 00:00 0
7f1379ffc000-7f1379ffd000 ---p 00000000 00:00 0
7f1379ffd000-7f137a7fd000 rwxp 00000000 00:00 0
7f137a7fd000-7f137a7fe000 ---p 00000000 00:00 0
7f137a7fe000-7f137affe000 rwxp 00000000 00:00 0
7f137affe000-7f137afff000 ---p 00000000 00:00 0
7f137afff000-7f137b7ff000 rwxp 00000000 00:00 0
7f137b7ff000-7f137b800000 ---p 00000000 00:00 0
7f137b800000-7f137c000000 rwxp 00000000 00:00 0
7f137c000000-7f137ff2a000 rw-p 00000000 00:00 0
7f137ff2a000-7f1380000000 ---p 00000000 00:00 0
7f13807f9000-7f13807fa000 ---p 00000000 00:00 0
7f13807fa000-7f1380ffa000 rwxp 00000000 00:00 0
7f1380ffa000-7f1380ffb000 ---p 00000000 00:00 0
7f1380ffb000-7f13817fb000 rwxp 00000000 00:00 0
7f13817fb000-7f13817fc000 ---p 00000000 00:00 0
7f13817fc000-7f1381ffc000 rwxp 00000000 00:00 0
7f1381ffc000-7f1381ffd000 ---p 00000000 00:00 0
7f1381ffd000-7f13827fd000 rwxp 00000000 00:00 0
7f13827fd000-7f13827fe000 ---p 00000000 00:00 0
7f13827fe000-7f1382ffe000 rwxp 00000000 00:00 0
7f1382ffe000-7f1382fff000 ---p 00000000 00:00 0
7f1382fff000-7f13837ff000 rwxp 00000000 00:00 0
7f13837ff000-7f1383800000 ---p 00000000 00:00 0
7f1383800000-7f1384000000 rwxp 00000000 00:00 0
7f1384000000-7f1384085000 rw-p 00000000 00:00 0
7f1384085000-7f1388000000 ---p 00000000 00:00 0
7f1388000000-7f138c000000 rw-p 00000000 00:00 0
7f138c000000-7f138c021000 rw-p 00000000 00:00 0
7f138c021000-7f1390000000 ---p 00000000 00:00 0
7f1390000000-7f1393c8c000 rw-p 00000000 00:00 0
7f1393c8c000-7f1394000000 ---p 00000000 00:00 0
7f1394000000-7f1394021000 rw-p 00000000 00:00 0
7f1394021000-7f1398000000 ---p 00000000 00:00 0
7f1398000000-7f1398021000 rw-p 00000000 00:00 0
7f1398021000-7f139c000000 ---p 00000000 00:00 0
7f139c76d000-7f139c76e000 ---p 00000000 00:00 0
7f139c76e000-7f139cf6e000 rwxp 00000000 00:00 0
7f139cf6e000-7f139cf6f000 ---p 00000000 00:00 0
7f139cf6f000-7f139d76f000 rwxp 00000000 00:00 0
7f139d76f000-7f139d770000 ---p 00000000 00:00 0
7f139d770000-7f139df70000 rwxp 00000000 00:00 0
7f139df70000-7f139df71000 ---p 00000000 00:00 0
7f139df71000-7f139e771000 rwxp 00000000 00:00 0
7f139e771000-7f13a0000000 rw-p 00000000 00:00 0
7f13a0000000-7f13a0001000 rw-s 00000000 00:06 330                        /dev/nvidia1
7f13a0001000-7f13a0002000 rw-s 00000000 00:06 330                        /dev/nvidia1
7f13a0002000-7f13a0003000 rw-s 00000000 00:06 330                        /dev/nvidia1
7f13a0003000-7f13a0004000 rw-s 00000000 00:06 330                        /dev/nvidia1
7f13a0004000-7f13a0005000 rw-s 00000000 00:06 330                        /dev/nvidia1
7f13a0005000-7f13a0006000 rw-s 00000000 00:06 330                        /dev/nvidia1
7f13a0006000-7f13a0007000 rw-s 00000000 00:06 330                        /dev/nvidia1
7f13a0007000-7f13a0008000 rw-s 00000000 00:06 330                        /dev/nvidia1
7f13a0008000-7f13a0009000 rw-s 00000000 00:06 330                        /dev/nvidia1
7f13a0009000-7f13a000a000 rw-s 00000000 00:06 330                        /dev/nvidia1
7f13a000a000-7f13a000b000 rw-s 00000000 00:06 330                        /dev/nvidia1
7f13a000b000-7f13a000c000 rw-s 00000000 00:06 330                        /dev/nvidia1
7f13a000c000-7f13a000d000 rw-s 00000000 00:06 330                        /dev/nvidia1
7f13a000d000-7f13a000e000 rw-s 00000000 00:06 330                        /dev/nvidia1
7f13a000e000-7f13a000f000 rw-s 00000000 00:06 330                        /dev/nvidia1
7f13a000f000-7f13a0010000 rw-s 00000000 00:06 330                        /dev/nvidia1
7f13a0010000-7f13b0000000 ---p 00000000 00:00 0
7f13b06fe000-7f13b06ff000 ---p 00000000 00:00 0
7f13b06ff000-7f13b0eff000 rwxp 00000000 00:00 0
7f13b0eff000-7f13b0f00000 ---p 00000000 00:00 0
7f13b0f00000-7f13b1700000 rwxp 00000000 00:00 0
7f13b1700000-7f13b1728000 r--p 00000000 ca:01 13827015                   /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so
7f13b1728000-7f13b1833000 r-xp 00028000 ca:01 13827015                   /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so
7f13b1833000-7f13b7fef000 r--p 00133000 ca:01 13827015                   /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so
7f13b7fef000-7f13b7ff6000 r--p 068ee000 ca:01 13827015                   /home/ubuntu/anaconda3/envs/mx_byteps/lib/pyth
Segmentation fault: 11

Segmentation fault: 11

on3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so
7f13b7ff6000-7f13b7ff9000 rw-p 068f5000 ca:01 13827015                   /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so
7f13b7ff9000-7f13b8000000 rw-p 00000000 00:00 0
7f13b8000000-7f13b802b000 rw-p 00000000 00:00 0
7f13b802b000-7f13bc000000 ---p 00000000 00:00 0
7f13bc000000-7f13bc02b000 rw-p 00000000 00:00 0
7f13bc02b000-7f13c0000000 ---p 00000000 00:00 0
7f13c0000000-7f13c002b000 rw-p 00000000 00:00 0
7f13c002b000-7f13c4000000 ---p 00000000 00:00 0
7f13c462d000-7f13c462e000 ---p 00000000 00:00 0
7f13c462e000-7f13c4e2e000 rwxp 00000000 00:00 0
7f13c4e2e000-7f13c4e2f000 ---p 00000000 00:00 0
7f13c4e2f000-7f13c562f000 rwxp 00000000 00:00 0
7f13c562f000-7f13c5630000 ---p 00000000 00:00 0
7f13c5630000-7f13c5e30000 rwxp 00000000 00:00 0
7f13c5e30000-7f13c5e31000 ---p 00000000 00:00 0
7f13c5e31000-7f13c6631000 rwxp 00000000 00:00 0
7f13c6631000-7f13c70dd000 rw-p 00000000 00:00 0
7f13c70dd000-7f13c70de000 ---p 00000000 00:00 0
7f13c70de000-7f13c78de000 rwxp 00000000 00:00 0
7f13c8000000-7f13c802b000 rw-p 00000000 00:00 0
7f13c802b000-7f13cc000000 ---p 00000000 00:00 0
7f13cc2d7000-7f13cc417000 rw-p 00000000 00:00 0
7f13cc417000-7f13cc560000 r-xp 00000000 ca:01 26288                      /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.440.33.01
7f13cc560000-7f13cc760000 ---p 00149000 ca:01 26288                      /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.440.33.01
7f13cc760000-7f13cc77d000 rw-p 00149000 ca:01 26288                      /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.440.33.01
7f13cc77d000-7f13cca3d000 rw-p 00000000 00:00 0
7f13cca3d000-7f13cca48000 r-xp 00000000 ca:01 51666                      /lib/x86_64-linux-gnu/libnss_files-2.23.so
7f13cca48000-7f13ccc47000 ---p 0000b000 ca:01 51666                      /lib/x86_64-linux-gnu/libnss_files-2.23.so
7f13ccc47000-7f13ccc48000 r--p 0000a000 ca:01 51666                      /lib/x86_64-linux-gnu/libnss_files-2.23.so
7f13ccc48000-7f13ccc49000 rw-p 0000b000 ca:01 51666                      /lib/x86_64-linux-gnu/libnss_files-2.23.so
7f13ccc49000-7f13ccc4f000 rw-p 00000000 00:00 0
7f13ccc4f000-7f13ccc5a000 r-xp 00000000 ca:01 51659                      /lib/x86_64-linux-gnu/libnss_nis-2.23.so
7f13ccc5a000-7f13cce59000 ---p 0000b000 ca:01 51659                      /lib/x86_64-linux-gnu/libnss_nis-2.23.so
7f13cce59000-7f13cce5a000 r--p 0000a000 ca:01 51659                      /lib/x86_64-linux-gnu/libnss_nis-2.23.so
7f13cce5a000-7f13cce5b000 rw-p 0000b000 ca:01 51659                      /lib/x86_64-linux-gnu/libnss_nis-2.23.so
7f13cce5b000-7f13cce71000 r-xp 00000000 ca:01 51664                      /lib/x86_64-linux-gnu/libnsl-2.23.so
7f13cce71000-7f13cd070000 ---p 00016000 ca:01 51664                      /lib/x86_64-linux-gnu/libnsl-2.23.so
7f13cd070000-7f13cd071000 r--p 00015000 ca:01 51664                      /lib/x86_64-linux-gnu/libnsl-2.23.so
7f13cd071000-7f13cd072000 rw-p 00016000 ca:01 51664        Stack trace:
  [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7f142911c100]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f14b56ba4b0]
  [bt] (2) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389f261) [0x7f1428813261]
  [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0611) [0x7f1428814611]
  [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7f142880a551]
  [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38974a4) [0x7f142880b4a4]
  [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::NDArray::Chunk::~Chunk()+0x48a) [0x7f1428a3356a]
  [bt] (7) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x6d860a) [0x7f142564c60a]
  [bt] (8) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3ab7101) [0x7f1428a2b101]
              /lib/x86_64-linux-gnu/libnsl-2.23.so
7f13cd072000-7f13cd074000 rw-p 00000000 00:00 0
7f13cd074000-7f13cd07c000 r-xp 00000000 ca:01 51670                      /lib/x86_64-linux-gnu/libnss_compat-2.23.so
7f13cd07c000-7f13cd27b000 ---p 00008000 ca:01 51670                      /lib/x86_64-linux-gnu/libnss_compat-2.23.so
7f13cd27b000-7f13cd27c000 r--p 00007000 ca:01 51670                      /lib/x86_64-linux-gnu/libnss_compat-2.23.so
7f13cd27c000-7f13cd27d000 rw-p 00008000 ca:01 51670                      /lib/x86_64-linux-gnu/libnss_compat-2.23.so
7f13cd27d000-7f13cd27e000 ---p 00000000 00:00 0
7f13cd27e000-7f13cda7e000 rwxp 00000000 00:00 0
7f13cda7e000-7f13cda7f000 ---p 00000000 00:00 0
7f13cda7f000-7f13ce27f000 rwxp 00000000 00:00 0
7f13ce27f000-7f13ce2da000 r-xp 00000000 ca:01 69289                      /usr/lib/x86_64-linux-gnu/libnl-route-3.so.200.22.0
7f13ce2da000-7f13ce4d9000 ---p 0005b000 ca:01 69289                      /usr/lib/x86_64-linux-gnu/libnl-route-3.so.200.22.0
7f13ce4d90
Segmentation fault: 11
ymjiang commented 4 years ago

Can you try run with export LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4? It enables better memory management.

You may need to install it by apt update && apt install libtcmalloc-minimal4.

access2rohit commented 4 years ago

@ymjiang

I still get the error similar to the first one now(but w/o nccl error):

(mx_byteps) ubuntu@ip-172-31-85-4:~$ bpslaunch python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32
BytePS launching worker
INFO:root:start with arguments Namespace(batch_size=32, benchmark=1, cpu_train=False, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
INFO:root:start with arguments Namespace(batch_size=32, benchmark=1, cpu_train=False, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
INFO:root:start with arguments Namespace(batch_size=32, benchmark=1, cpu_train=False, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
INFO:root:start with arguments Namespace(batch_size=32, benchmark=1, cpu_train=False, data_nthreads=4, data_train=None, data_train_idx='', data_val=None, data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
INFO:root:Launch BytePS process on GPU-0
INFO:root:Launch BytePS process on GPU-1
learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer.
INFO:root:Launch BytePS process on GPU-2
INFO:root:Launch BytePS process on GPU-3
learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer.
learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer.
learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer.
environ({'LESSOPEN': '| /usr/bin/lesspipe %s', 'CONDA_PROMPT_MODIFIER': '(mx_byteps) ', 'BYTEPS_LOCAL_RANK': '0', 'MAIL': '/var/mail/ubuntu', 'SSH_CLIENT': '72.21.198.65 30043 22', 'USER': 'ubuntu', 'LD_LIBRARY_PATH_WITH_DEFAULT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:', 'LD_LIBRARY_PATH': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/home/ubuntu/src/cntk/bindings/python/cntk/libs:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/opt/amazon/efa/lib:/usr/local/mpi/lib:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:', 'SHLVL': '1', 'CONDA_SHLVL': '1', 'HOME': '/home/ubuntu', 'SSH_TTY': '/dev/pts/0', 'DMLC_PS_ROOT_URI': '10.0.0.1', 'LC_TERMINAL_VERSION': '3.3.9', 'DMLC_NUM_SERVER': '1', 'LOGNAME': 'ubuntu', 'DMLC_PS_ROOT_PORT': '1234', '_': '/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch', 'BYTEPS_LOCAL_SIZE': '4', 'PKG_CONFIG_PATH': '/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:', 'XDG_SESSION_ID': '8', 'TERM': 'xterm-256color', 'DMLC_NUM_WORKER': '1', 'PATH': '/home/ubuntu/anaconda3/envs/mx_byteps/bin:/home/ubuntu/anaconda3/bin/:/home/ubuntu/bin:/home/ubuntu/.local/bin:/home/ubuntu/anaconda3/bin/:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/home/ubuntu/src/cntk/bin:/usr/local/mpi/bin:/opt/aws/neuron/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'DMLC_ROLE': 'worker', 'XDG_RUNTIME_DIR': '/run/user/1000', 'LD_PRELOAD': '/usr/lib/libtcmalloc_minimal.so.4', 'LANG': 'en_US.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'CONDA_PYTHON_EXE': '/home/ubuntu/anaconda3/bin/python', 'SHELL': '/bin/bash', 'CONDA_DEFAULT_ENV': 'mx_byteps', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'MODULE_VERSION': '3.2.10', 'LD_LIBRARY_PATH_WITHOUT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:', 'LC_TERMINAL': 'iTerm2', 'MODULE_VERSION_STACK': '3.2.10', 'PWD': '/home/ubuntu', 'LOADEDMODULES': '', 'CONDA_EXE': '/home/ubuntu/anaconda3/bin/conda', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'SSH_CONNECTION': '72.21.198.65 30043 172.31.85.4 22', 'PYTHONPATH': '/home/ubuntu/src/cntk/bindings/python', 'DMLC_WORKER_ID': '0', 'NVIDIA_VISIBLE_DEVICES': '0,1,2,3', 'CONDA_PREFIX': '/home/ubuntu/anaconda3/envs/mx_byteps', 'MANPATH': '/opt/aws/neuron/share/man:', 'MODULEPATH': '/etc/environment-modules/modules:/usr/share/modules/versions:/usr/Modules/$MODULE_VERSION/modulefiles:/usr/share/modules/modulefiles', 'MODULESHOME': '/usr/share/modules'})=============0
environ({'LESSOPEN': '| /usr/bin/lesspipe %s', 'CONDA_PROMPT_MODIFIER': '(mx_byteps) ', 'BYTEPS_LOCAL_RANK': '3', 'MAIL': '/var/mail/ubuntu', 'SSH_CLIENT': '72.21.198.65 30043 22', 'USER': 'ubuntu', 'LD_LIBRARY_PATH_WITH_DEFAULT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:', 'LD_LIBRARY_PATH': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/home/ubuntu/src/cntk/bindings/python/cntk/libs:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/opt/amazon/efa/lib:/usr/local/mpi/lib:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:', 'SHLVL': '1', 'CONDA_SHLVL': '1', 'HOME': '/home/ubuntu', 'SSH_TTY': '/dev/pts/0', 'DMLC_PS_ROOT_URI': '10.0.0.1', 'LC_TERMINAL_VERSION': '3.3.9', 'DMLC_NUM_SERVER': '1', 'LOGNAME': 'ubuntu', 'DMLC_PS_ROOT_PORT': '1234', '_': '/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch', 'BYTEPS_LOCAL_SIZE': '4', 'PKG_CONFIG_PATH': '/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:', 'XDG_SESSION_ID': '8', 'TERM': 'xterm-256color', 'DMLC_NUM_WORKER': '1', 'PATH': '/home/ubuntu/anaconda3/envs/mx_byteps/bin:/home/ubuntu/anaconda3/bin/:/home/ubuntu/bin:/home/ubuntu/.local/bin:/home/ubuntu/anaconda3/bin/:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/home/ubuntu/src/cntk/bin:/usr/local/mpi/bin:/opt/aws/neuron/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'DMLC_ROLE': 'worker', 'XDG_RUNTIME_DIR': '/run/user/1000', 'LD_PRELOAD': '/usr/lib/libtcmalloc_minimal.so.4', 'LANG': 'en_US.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'CONDA_PYTHON_EXE': '/home/ubuntu/anaconda3/bin/python', 'SHELL': '/bin/bash', 'CONDA_DEFAULT_ENV': 'mx_byteps', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'MODULE_VERSION': '3.2.10', 'LD_LIBRARY_PATH_WITHOUT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:', 'LC_TERMINAL': 'iTerm2', 'MODULE_VERSION_STACK': '3.2.10', 'PWD': '/home/ubuntu', 'LOADEDMODULES': '', 'CONDA_EXE': '/home/ubuntu/anaconda3/bin/conda', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'SSH_CONNECTION': '72.21.198.65 30043 172.31.85.4 22', 'PYTHONPATH': '/home/ubuntu/src/cntk/bindings/python', 'DMLC_WORKER_ID': '0', 'NVIDIA_VISIBLE_DEVICES': '0,1,2,3', 'CONDA_PREFIX': '/home/ubuntu/anaconda3/envs/mx_byteps', 'MANPATH': '/opt/aws/neuron/share/man:', 'MODULEPATH': '/etc/environment-modules/modules:/usr/share/modules/versions:/usr/Modules/$MODULE_VERSION/modulefiles:/usr/share/modules/modulefiles', 'MODULESHOME': '/usr/share/modules'})=============3
environ({'LESSOPEN': '| /usr/bin/lesspipe %s', 'CONDA_PROMPT_MODIFIER': '(mx_byteps) ', 'BYTEPS_LOCAL_RANK': '2', 'MAIL': '/var/mail/ubuntu', 'SSH_CLIENT': '72.21.198.65 30043 22', 'USER': 'ubuntu', 'LD_LIBRARY_PATH_WITH_DEFAULT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:', 'LD_LIBRARY_PATH': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/home/ubuntu/src/cntk/bindings/python/cntk/libs:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/opt/amazon/efa/lib:/usr/local/mpi/lib:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:', 'SHLVL': '1', 'CONDA_SHLVL': '1', 'HOME': '/home/ubuntu', 'SSH_TTY': '/dev/pts/0', 'DMLC_PS_ROOT_URI': '10.0.0.1', 'LC_TERMINAL_VERSION': '3.3.9', 'DMLC_NUM_SERVER': '1', 'LOGNAME': 'ubuntu', 'DMLC_PS_ROOT_PORT': '1234', '_': '/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch', 'BYTEPS_LOCAL_SIZE': '4', 'PKG_CONFIG_PATH': '/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:', 'XDG_SESSION_ID': '8', 'TERM': 'xterm-256color', 'DMLC_NUM_WORKER': '1', 'PATH': '/home/ubuntu/anaconda3/envs/mx_byteps/bin:/home/ubuntu/anaconda3/bin/:/home/ubuntu/bin:/home/ubuntu/.local/bin:/home/ubuntu/anaconda3/bin/:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/home/ubuntu/src/cntk/bin:/usr/local/mpi/bin:/opt/aws/neuron/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'DMLC_ROLE': 'worker', 'XDG_RUNTIME_DIR': '/run/user/1000', 'LD_PRELOAD': '/usr/lib/libtcmalloc_minimal.so.4', 'LANG': 'en_US.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'CONDA_PYTHON_EXE': '/home/ubuntu/anaconda3/bin/python', 'SHELL': '/bin/bash', 'CONDA_DEFAULT_ENV': 'mx_byteps', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'MODULE_VERSION': '3.2.10', 'LD_LIBRARY_PATH_WITHOUT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:', 'LC_TERMINAL': 'iTerm2', 'MODULE_VERSION_STACK': '3.2.10', 'PWD': '/home/ubuntu', 'LOADEDMODULES': '', 'CONDA_EXE': '/home/ubuntu/anaconda3/bin/conda', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'SSH_CONNECTION': '72.21.198.65 30043 172.31.85.4 22', 'PYTHONPATH': '/home/ubuntu/src/cntk/bindings/python', 'DMLC_WORKER_ID': '0', 'NVIDIA_VISIBLE_DEVICES': '0,1,2,3', 'CONDA_PREFIX': '/home/ubuntu/anaconda3/envs/mx_byteps', 'MANPATH': '/opt/aws/neuron/share/man:', 'MODULEPATH': '/etc/environment-modules/modules:/usr/share/modules/versions:/usr/Modules/$MODULE_VERSION/modulefiles:/usr/share/modules/modulefiles', 'MODULESHOME': '/usr/share/modules'})=============2
environ({'LESSOPEN': '| /usr/bin/lesspipe %s', 'CONDA_PROMPT_MODIFIER': '(mx_byteps) ', 'BYTEPS_LOCAL_RANK': '1', 'MAIL': '/var/mail/ubuntu', 'SSH_CLIENT': '72.21.198.65 30043 22', 'USER': 'ubuntu', 'LD_LIBRARY_PATH_WITH_DEFAULT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda-9.0/lib/:', 'LD_LIBRARY_PATH': '/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/home/ubuntu/src/cntk/bindings/python/cntk/libs:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/efa/lib:/usr/local/cuda/lib:/opt/amazon/efa/lib:/usr/local/mpi/lib:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:', 'SHLVL': '1', 'CONDA_SHLVL': '1', 'HOME': '/home/ubuntu', 'SSH_TTY': '/dev/pts/0', 'DMLC_PS_ROOT_URI': '10.0.0.1', 'LC_TERMINAL_VERSION': '3.3.9', 'DMLC_NUM_SERVER': '1', 'LOGNAME': 'ubuntu', 'DMLC_PS_ROOT_PORT': '1234', '_': '/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch', 'BYTEPS_LOCAL_SIZE': '4', 'PKG_CONFIG_PATH': '/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:', 'XDG_SESSION_ID': '8', 'TERM': 'xterm-256color', 'DMLC_NUM_WORKER': '1', 'PATH': '/home/ubuntu/anaconda3/envs/mx_byteps/bin:/home/ubuntu/anaconda3/bin/:/home/ubuntu/bin:/home/ubuntu/.local/bin:/home/ubuntu/anaconda3/bin/:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/home/ubuntu/src/cntk/bin:/usr/local/mpi/bin:/opt/aws/neuron/bin:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'DMLC_ROLE': 'worker', 'XDG_RUNTIME_DIR': '/run/user/1000', 'LD_PRELOAD': '/usr/lib/libtcmalloc_minimal.so.4', 'LANG': 'en_US.UTF-8', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'CONDA_PYTHON_EXE': '/home/ubuntu/anaconda3/bin/python', 'SHELL': '/bin/bash', 'CONDA_DEFAULT_ENV': 'mx_byteps', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'MODULE_VERSION': '3.2.10', 'LD_LIBRARY_PATH_WITHOUT_CUDA': '/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:', 'LC_TERMINAL': 'iTerm2', 'MODULE_VERSION_STACK': '3.2.10', 'PWD': '/home/ubuntu', 'LOADEDMODULES': '', 'CONDA_EXE': '/home/ubuntu/anaconda3/bin/conda', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'SSH_CONNECTION': '72.21.198.65 30043 172.31.85.4 22', 'PYTHONPATH': '/home/ubuntu/src/cntk/bindings/python', 'DMLC_WORKER_ID': '0', 'NVIDIA_VISIBLE_DEVICES': '0,1,2,3', 'CONDA_PREFIX': '/home/ubuntu/anaconda3/envs/mx_byteps', 'MANPATH': '/opt/aws/neuron/share/man:', 'MODULEPATH': '/etc/environment-modules/modules:/usr/share/modules/versions:/usr/Modules/$MODULE_VERSION/modulefiles:/usr/share/modules/modulefiles', 'MODULESHOME': '/usr/share/modules'})=============1

Segmentation fault: 11

Stack trace:
  [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7fea6d53e100]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7feaf9bd04b0]
  [bt] (2) /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7feaf9f6ed44]
  [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389d737) [0x7fea6cc33737]
  [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0863) [0x7fea6cc36863]
  [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7fea6cc2c551]
  [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(MXEnginePushAsync+0x2f7) [0x7fea6cb92a67]
  [bt] (7) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps_mxnet_push_pull_async+0x150) [0x7fea0c31c970]
  [bt] (8) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7feaf8bc2ec0]

Segmentation fault: 11

Segmentation fault: 11

Stack trace:
  [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7f36d9931100]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f3765fc34b0]
  [bt] (2) /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f3766361d44]
  [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389d737) [0x7f36d9026737]
  [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0863) [0x7f36d9029863]
  [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7f36d901f551]
  [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(MXEnginePushAsync+0x2f7) [0x7f36d8f85a67]
  [bt] (7) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps_mxnet_push_pull_async+0x150) [0x7f367870f970]
  [bt] (8) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f3764fb5ec0]

Segmentation fault: 11

Stack trace:
  [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7f36d9931100]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f3765fc34b0]

Segmentation fault: 11

Stack trace:
  [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7fe8b2d89100]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7fe93f41b4b0]
  [bt] (2) /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7fe93f7b9d44]
  [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389d737) [0x7fe8b247e737]
  [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0863) [0x7fe8b2481863]
  [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7fe8b2477551]
  [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(MXEnginePushAsync+0x2f7) [0x7fe8b23dda67]
  [bt] (7) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps_mxnet_push_pull_async+0x150) [0x7fe851b6d970]
  [bt] (8) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7fe93e40dec0]

Segmentation fault: 11

Stack trace:
  [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7fe8b2d89100]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7fe93f41b4b0]

Segmentation fault: 11

Stack trace:
  [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7f5f11425100]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f5f9dab74b0]
  [bt] (2) /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f5f9de55d44]
  [bt] (3) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x389d737) [0x7f5f10b1a737]
  [bt] (4) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x38a0863) [0x7f5f10b1d863]
  [bt] (5) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3896551) [0x7f5f10b13551]
  [bt] (6) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(MXEnginePushAsync+0x2f7) [0x7f5f10a79a67]
  [bt] (7) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/site-packages/byteps/mxnet/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps_mxnet_push_pull_async+0x150) [0x7f5eb0203970]
  [bt] (8) /home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f5f9caa9ec0]

Segmentation fault: 11

Stack trace:
  [bt] (0) /home/ubuntu/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8100) [0x7f5f11425100]
  [bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f5f9dab74b0]
Segmentation fault (core dumped)
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch", line 47, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32' returned non-zero exit status 139.

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch", line 47, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32' returned non-zero exit status 255.

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch", line 47, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32' returned non-zero exit status 255.

Bus error (core dumped)
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/mx_byteps/bin/bpslaunch", line 47, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/home/ubuntu/anaconda3/envs/mx_byteps/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python byteps/example/mxnet/train_imagenet_byteps.py --benchmark 1 --batch-size=32' returned non-zero exit status 135.
jasperzhong commented 4 years ago

@access2rohit can you try to use pip install -U mxnet-cu100==1.5.0? it worked when i tested a few days ago.

access2rohit commented 4 years ago

@vycezhong It worked with MXNet-1.5.0. Its strange though that it doesn't work with MXNet-1.5.1 and MXNet-1.6.0 though

jasperzhong commented 4 years ago

@access2rohit that's really strange. i have no idea now.

ymjiang commented 4 years ago

We will try to reproduce the problem for MXNet-1.6.0.

ymjiang commented 4 years ago

It seems that MXNet-1.6.0 releases are not stable right now: https://github.com/apache/incubator-mxnet/issues/17715.

This one works for us: pip3 install mxnet-cu100==1.6.0b20190817 --pre

eric-haibin-lin commented 4 years ago

The list of nightly and stable mxnet pip wheels can be found here: https://dist.mxnet.io/python

jasperzhong commented 4 years ago

i found i cannot submit new issues or PR. the button is in gray. what happened?

bobzhuyb commented 4 years ago

i found i cannot submit new issues or PR. the button is in gray. what happened?

I don't think we have changed any configurations. It's off the topic. Let's talk about this offline.

jasperzhong commented 4 years ago

@access2rohit I do not encounter any problem using mxnet==1.6.0. Is your cuda version right? It should be 10.2 for mxnet==1.6.0. And you should recompile byteps after updating mxnet.

eric-haibin-lin commented 4 years ago

I find that with byteps 0.2, https://repo.mxnet.io/dist/python/cu100/mxnet_cu100-1.6.0b20200212-py2.py3-none-manylinux1_x86_64.whl works but

https://repo.mxnet.io/dist/python/cu100/mxnet_cu100-1.6.0b20200215-py2.py3-none-manylinux1_x86_64.whl leads to segfault.

It's probably a regression in mxnet

ymjiang commented 4 years ago

@access2rohit @eric-haibin-lin We will fix the problem with MXNet-MKL in https://github.com/bytedance/byteps/pull/244.

eric-haibin-lin commented 4 years ago

The mxnet community passed a vote to turn on mkldnn by default for future releases. I do not think #244 will help then

let me give the patch a try

bobzhuyb commented 4 years ago

This is fixed as @ymjiang mentioned.