Open ilmarkov opened 4 years ago
Does it also happen if you use the official bytepsimage/pytorch
?
@ymjiang, That image is built with an error. There was a conflict between torchvision and pillow.
So when I run my script it fails with an error:
ImportError: cannot import name 'PILLOW_VERSION'
When I built local image from sources I fixed the line with adding "pillow<7"
.
Maybe it's related to this issue. K80 GPUs are a bit old and we don't have proper environment for testing..
https://github.com/bytedance/byteps/issues/165#issuecomment-560082086
@bobzhuyb I tried the proposed solutions with moving tensors to gpu on K80 GPUs and V100 gpus. The result is the same:
F byteps/common/nccl_manager.cc:37] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered
Aborted (core dumped)
Are you talking about your first or second scenario? From the output, I think it's the first scenario (1 worker x 8 GPUs). This error is weird because byteps is merely calling NCCL with DMLC_NUM_WORKER=1
. Have you also tried PyTorch's DDP or Horovod? Do any of them work properly?
By the way, I found that you mount the same folder into multiple workers as /tmp
https://gist.github.com/ilmarkov/610165ddba3c602451b38f28d83575d8#file-start_worker-sh-L23
This may cause problems for byteps since it creates its per worker control sockets in /tmp
https://github.com/bytedance/byteps/blob/master/byteps/common/communicator.h#L35
You can try setting BYTEPS_SOCKET_PATH
to another path other than /tmp
. See here https://github.com/bytedance/byteps/blob/master/byteps/common/communicator.cc#L97
I am not sure whether this is the problem, though.
I am talking about the first scenario, the one without parameter server. The behaviour in second scenario is a bit different.
I checked horovod NCCL training in docker container. It works well. Another thing I tried is I used the official image of byteps, ran container, reinstalled pillow and tried to ran train_mnist_byteps.py there. It failed (even with the fix, mentioned the issue you referenced). All runs were made on 4xV100 machine.
Though, the benchmark_byteps.py works fine. So the issue seems to be with an example mnist code. I will check the second scenario with the benchmark code tomorrow and let you know if it works fine.
benchmark_byteps.py
and benchmark_cross_barrier_byteps.py
work fine in both scenarios.
Also I found out that in order to launch two consecutive benchmarks in second scenario, I need to restart server and scheduler. Is it expected behaviour?
@ilmarkov We have fixed the pillow problem in bytepsimage/pytorch
, but cannot reproduce your segmentation fault -- all pytorch examples run well using that image. Would you pull the latest tag and try again?
Also I found out that in order to launch two consecutive benchmarks in second scenario, I need to restart server and scheduler. Is it expected behaviour?
Yes, this is expected.
@ymjiang Sorry for the late response. I tried it again - mnist example still crashes in 2 worker 1 server mode. And it outputs meaningless accuracy in any kind of training. But imagenet and synthetic benchmarks are working fine.
Though, I have problems with distributed training. The same as described in issue. I have 1 node where 1 server, 1 scheduler and 1 worker are running and another node where 1 worker is running. And training can't start. It hangs with output:
:287: Start ZMQ recv thread
[11:53:39] src/van.cc:478: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE,
node={ role=worker, ip=<ip_address>, port=<port>, is_recovery=0 } }. THIS IS NOT DATA MSG!
Nodes are accessible to each other. I checked it with ping and passwordless ssh. Are there any specific requirements for network interfaces or setup?
@ilmarkov Are you running on a public cloud or anywhere that may have security rules? BytePS uses different TCP ports from ssh. For example, when you configure the scheduler to run on port 1234
, can you confirm that the port 1234
can be accessed from other machines? You can use telnet
to test connecting to it.
Describe the bug
Running all instances which are 2 workers, scheduler and server on one node with multiple gpus crashes when one of workers is asked to be run on several gpus.
As a setup I have machine with 8 gpus. I launch all instances in docker container built from Dockerfile.pytorch taken from master branch.
To Reproduce
I am using following scripts: start_serv.sh start_worker.sh Scripts use the following env files: sched.env
serv.env
worker.env
I tried 2 scenarios: a) Running single worker instance in docker without server and scheduler (Setting
DMLC_NUM_WORKER=1
).start worker 0: ./run_worker.sh 0
I got segmentation fault and error message:F byteps/common/nccl_manager.cc:37] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered
b) Running 2 workers each having 1 gpu, scheduler, server (Setting
DMLC_NUM_WORKER=1
and fixingrun_worker.sh
accordingly)Training started but it didn't train model properly and crashed after 2nd epoch with following output: log
Expected behavior I expected workers to run mnist example properly.
Environment (please complete the following information): Everything was run in docker containers on aws EC2 instance with 8 Tesla K80 gpus(Driver Version: 418.87.00 CUDA Version: 10.1). Docker version 18.09.7, build 2d0083d.