bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.63k stars 490 forks source link

Segmentation fault docker #201

Open ilmarkov opened 4 years ago

ilmarkov commented 4 years ago

Describe the bug

Running all instances which are 2 workers, scheduler and server on one node with multiple gpus crashes when one of workers is asked to be run on several gpus.

As a setup I have machine with 8 gpus. I launch all instances in docker container built from Dockerfile.pytorch taken from master branch.

To Reproduce

I am using following scripts: start_serv.sh start_worker.sh Scripts use the following env files: sched.env

DMLC_NUM_WORKER=2
DMLC_ROLE=scheduler
DMLC_NUM_SERVER=1
DMLC_PS_ROOT_URI=127.0.0.1
DMLC_PS_ROOT_PORT=1234

serv.env

DMLC_NUM_WORKER=2
DMLC_ROLE=server
DMLC_NUM_SERVER=1
DMLC_PS_ROOT_URI=127.0.0.1
DMLC_PS_ROOT_PORT=1234
MXNET_OMP_MAX_THREADS=4

worker.env

DMLC_NUM_WORKER=2
DMLC_ROLE=worker
DMLC_NUM_SERVER=1
DMLC_PS_ROOT_URI=127.0.0.1
DMLC_PS_ROOT_PORT=1234
BYTEPS_LOG_LEVEL=debug

I tried 2 scenarios: a) Running single worker instance in docker without server and scheduler (Setting DMLC_NUM_WORKER=1). start worker 0: ./run_worker.sh 0 I got segmentation fault and error message: F byteps/common/nccl_manager.cc:37] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered

b) Running 2 workers each having 1 gpu, scheduler, server (Setting DMLC_NUM_WORKER=1 and fixing run_worker.sh accordingly)

  1. start server: ./run_serv.sh serv
  2. start scheduler: ./run_serv.sh sched
  3. start worker 0: ./run_worker.sh 0
  4. start worker 1: ./run_worker.sh 1

Training started but it didn't train model properly and crashed after 2nd epoch with following output: log

Expected behavior I expected workers to run mnist example properly.

Environment (please complete the following information): Everything was run in docker containers on aws EC2 instance with 8 Tesla K80 gpus(Driver Version: 418.87.00 CUDA Version: 10.1). Docker version 18.09.7, build 2d0083d.

ymjiang commented 4 years ago

Does it also happen if you use the official bytepsimage/pytorch?

ilmarkov commented 4 years ago

@ymjiang, That image is built with an error. There was a conflict between torchvision and pillow. So when I run my script it fails with an error: ImportError: cannot import name 'PILLOW_VERSION' When I built local image from sources I fixed the line with adding "pillow<7".

bobzhuyb commented 4 years ago

Maybe it's related to this issue. K80 GPUs are a bit old and we don't have proper environment for testing..

https://github.com/bytedance/byteps/issues/165#issuecomment-560082086

ilmarkov commented 4 years ago

@bobzhuyb I tried the proposed solutions with moving tensors to gpu on K80 GPUs and V100 gpus. The result is the same:

F byteps/common/nccl_manager.cc:37] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered
Aborted (core dumped)
bobzhuyb commented 4 years ago

Are you talking about your first or second scenario? From the output, I think it's the first scenario (1 worker x 8 GPUs). This error is weird because byteps is merely calling NCCL with DMLC_NUM_WORKER=1. Have you also tried PyTorch's DDP or Horovod? Do any of them work properly?

By the way, I found that you mount the same folder into multiple workers as /tmp https://gist.github.com/ilmarkov/610165ddba3c602451b38f28d83575d8#file-start_worker-sh-L23 This may cause problems for byteps since it creates its per worker control sockets in /tmp https://github.com/bytedance/byteps/blob/master/byteps/common/communicator.h#L35

You can try setting BYTEPS_SOCKET_PATH to another path other than /tmp. See here https://github.com/bytedance/byteps/blob/master/byteps/common/communicator.cc#L97

I am not sure whether this is the problem, though.

ilmarkov commented 4 years ago

I am talking about the first scenario, the one without parameter server. The behaviour in second scenario is a bit different.

I checked horovod NCCL training in docker container. It works well. Another thing I tried is I used the official image of byteps, ran container, reinstalled pillow and tried to ran train_mnist_byteps.py there. It failed (even with the fix, mentioned the issue you referenced). All runs were made on 4xV100 machine.

Though, the benchmark_byteps.py works fine. So the issue seems to be with an example mnist code. I will check the second scenario with the benchmark code tomorrow and let you know if it works fine.

ilmarkov commented 4 years ago

benchmark_byteps.py and benchmark_cross_barrier_byteps.py work fine in both scenarios. Also I found out that in order to launch two consecutive benchmarks in second scenario, I need to restart server and scheduler. Is it expected behaviour?

ymjiang commented 4 years ago

@ilmarkov We have fixed the pillow problem in bytepsimage/pytorch, but cannot reproduce your segmentation fault -- all pytorch examples run well using that image. Would you pull the latest tag and try again?

Also I found out that in order to launch two consecutive benchmarks in second scenario, I need to restart server and scheduler. Is it expected behaviour?

Yes, this is expected.

ilmarkov commented 4 years ago

@ymjiang Sorry for the late response. I tried it again - mnist example still crashes in 2 worker 1 server mode. And it outputs meaningless accuracy in any kind of training. But imagenet and synthetic benchmarks are working fine.

Though, I have problems with distributed training. The same as described in issue. I have 1 node where 1 server, 1 scheduler and 1 worker are running and another node where 1 worker is running. And training can't start. It hangs with output:

:287: Start ZMQ recv thread
[11:53:39] src/van.cc:478: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, 
node={ role=worker, ip=<ip_address>, port=<port>, is_recovery=0 } }. THIS IS NOT DATA MSG!

Nodes are accessible to each other. I checked it with ping and passwordless ssh. Are there any specific requirements for network interfaces or setup?

bobzhuyb commented 4 years ago

@ilmarkov Are you running on a public cloud or anywhere that may have security rules? BytePS uses different TCP ports from ssh. For example, when you configure the scheduler to run on port 1234, can you confirm that the port 1234 can be accessed from other machines? You can use telnet to test connecting to it.