bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.63k stars 490 forks source link

broadcast_optimizer_state for pytorch needs to be able to handle NoneType params #363

Closed dbonner closed 3 years ago

dbonner commented 3 years ago

This is a proposed fix for:

https://github.com/bytedance/byteps/issues/362

It is needed because the latest pytorch from source can include NoneType in optimizer parameters. So these needs to be allowed in broadcast_optimizer_state.

The code in the above issue works after these proposed changes are made: i.e. with 8 gpus (on the same host): bpslaunch python byteps/example/pytorch/benchmark_byteps.py --fp16-pushpull works again.

pleasantrabbit commented 3 years ago

@dbonner thanks for the patch. Will review it shortly.

dbonner commented 3 years ago

Thanks :) They had to patch horovod for the same reason. Before the patch, the horovod pytorch benchmark failed on my 8 GPU single host. See: https://github.com/horovod/horovod/commit/6889773ea1f550042e37a219c63ee4f4200e983c

dbonner commented 3 years ago

Hi @pleasantrabbit , Just checked and this problem still exists. My patch/pull request does fix it. It means that byteps will not work with PyTorch 1.8, which is about to be released:

export NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 # 8 GPU machine export DMLC_WORKER_ID=0 # your worker id export DMLC_NUM_WORKER=1 # one worker export DMLC_ROLE=worker export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=localhost export DMLC_PS_ROOT_PORT=10000 bpslaunch python byteps/example/pytorch/benchmark_byteps.py --fp16-pushpull

BytePS launching worker Traceback (most recent call last): File "/home/daniel/localgpu/byteps/example/pytorch/benchmark_byteps.py", line 72, in Traceback (most recent call last): File "/home/daniel/localgpu/byteps/example/pytorch/benchmark_byteps.py", line 72, in bps.broadcast_optimizer_state(optimizer, root_rank=0) File "/home/daniel/py38/lib/python3.8/site-packages/byteps/torch/init.py", line 398, in broadcast_optimizer_state bps.broadcast_optimizer_state(optimizer, root_rank=0) File "/home/daniel/py38/lib/python3.8/site-packages/byteps/torch/init.py", line 398, in broadcast_optimizer_state p = torch.Tensor([p]).cuda() TypeError: must be real number, not NoneType p = torch.Tensor([p]).cuda() TypeError: must be real number, not NoneType Traceback (most recent call last): File "/home/daniel/localgpu/byteps/example/pytorch/benchmark_byteps.py", line 72, in Traceback (most recent call last): File "/home/daniel/localgpu/byteps/example/pytorch/benchmark_byteps.py", line 72, in bps.broadcast_optimizer_state(optimizer, root_rank=0) File "/home/daniel/py38/lib/python3.8/site-packages/byteps/torch/init.py", line 398, in broadcast_optimizer_state p = torch.Tensor([p]).cuda() TypeError: must be real number, not NoneType bps.broadcast_optimizer_state(optimizer, root_rank=0) File "/home/daniel/py38/lib/python3.8/site-packages/byteps/torch/init.py", line 398, in broadcast_optimizer_state p = torch.Tensor([p]).cuda() TypeError: must be real number, not NoneType Traceback (most recent call last): File "/home/daniel/localgpu/byteps/example/pytorch/benchmark_byteps.py", line 72, in Traceback (most recent call last): File "/home/daniel/localgpu/byteps/example/pytorch/benchmark_byteps.py", line 72, in bps.broadcast_optimizer_state(optimizer, root_rank=0) File "/home/daniel/py38/lib/python3.8/site-packages/byteps/torch/init.py", line 398, in broadcast_optimizer_state Traceback (most recent call last): File "/home/daniel/localgpu/byteps/example/pytorch/benchmark_byteps.py", line 72, in bps.broadcast_optimizer_state(optimizer, root_rank=0) File "/home/daniel/py38/lib/python3.8/site-packages/byteps/torch/init.py", line 398, in broadcast_optimizer_state p = torch.Tensor([p]).cuda() TypeError: must be real number, not NoneType p = torch.Tensor([p]).cuda() TypeError: must be real number, not NoneType bps.broadcast_optimizer_state(optimizer, root_rank=0) File "/home/daniel/py38/lib/python3.8/site-packages/byteps/torch/init.py", line 398, in broadcast_optimizer_state p = torch.Tensor([p]).cuda() TypeError: must be real number, not NoneType Traceback (most recent call last): File "/home/daniel/localgpu/byteps/example/pytorch/benchmark_byteps.py", line 72, in bps.broadcast_optimizer_state(optimizer, root_rank=0) File "/home/daniel/py38/lib/python3.8/site-packages/byteps/torch/init.py", line 398, in broadcast_optimizer_state p = torch.Tensor([p]).cuda() TypeError: must be real number, not NoneType Traceback (most recent call last): File "/home/daniel/py38/bin/bpslaunch", line 220, in launch_bps() File "/home/daniel/py38/bin/bpslaunch", line 206, in launch_bps t[i].join() File "/home/daniel/py38/bin/bpslaunch", line 34, in join raise self.exc File "/home/daniel/py38/bin/bpslaunch", line 27, in run self.ret = self._target(*self._args, **self._kwargs) File "/home/daniel/py38/bin/bpslaunch", line 176, in worker subprocess.check_call(command, env=my_env, File "/home/daniel/.pyenv/versions/3.8.7/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'python /home/daniel/localgpu/byteps/example/pytorch/benchmark_byteps.py --fp16-pushpull' returned non-zero exit status 1.

pleasantrabbit commented 3 years ago

Hi @pleasantrabbit , Just checked and this problem still exists. My patch/pull request does fix it. It means that byteps will not work with PyTorch 1.8, which is about to be released:

export NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 # 8 GPU machine export DMLC_WORKER_ID=0 # your worker id export DMLC_NUM_WORKER=1 # one worker export DMLC_ROLE=worker export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=localhost export DMLC_PS_ROOT_PORT=10000 bpslaunch python byteps/example/pytorch/benchmark_byteps.py --fp16-pushpull

BytePS launching worker Traceback (most recent call last): File "/home/daniel/localgpu/byteps/example/pytorch/benchmark_byteps.py", line 72, in Traceback (most recent call last): File "/home/daniel/localgpu/byteps/example/pytorch/benchmark_byteps.py", line 72, in bps.broadcast_optimizer_state(optimizer, root_rank=0) File "/home/daniel/py38/lib/python3.8/site-packages/byteps/torch/init.py", line 398, in broadcast_optimizer_state bps.broadcast_optimizer_state(optimizer, root_rank=0) File "/home/daniel/py38/lib/python3.8/site-packages/byteps/torch/init.py", line 398, in broadcast_optimizer_state p = torch.Tensor([p]).cuda() TypeError: must be real number, not NoneType p = torch.Tensor([p]).cuda() TypeError: must be real number, not NoneType Traceback (most recent call last): File "/home/daniel/localgpu/byteps/example/pytorch/benchmark_byteps.py", line 72, in Traceback (most recent call last): File "/home/daniel/localgpu/byteps/example/pytorch/benchmark_byteps.py", line 72, in bps.broadcast_optimizer_state(optimizer, root_rank=0) File "/home/daniel/py38/lib/python3.8/site-packages/byteps/torch/init.py", line 398, in broadcast_optimizer_state p = torch.Tensor([p]).cuda() TypeError: must be real number, not NoneType bps.broadcast_optimizer_state(optimizer, root_rank=0) File "/home/daniel/py38/lib/python3.8/site-packages/byteps/torch/init.py", line 398, in broadcast_optimizer_state p = torch.Tensor([p]).cuda() TypeError: must be real number, not NoneType Traceback (most recent call last): File "/home/daniel/localgpu/byteps/example/pytorch/benchmark_byteps.py", line 72, in Traceback (most recent call last): File "/home/daniel/localgpu/byteps/example/pytorch/benchmark_byteps.py", line 72, in bps.broadcast_optimizer_state(optimizer, root_rank=0) File "/home/daniel/py38/lib/python3.8/site-packages/byteps/torch/init.py", line 398, in broadcast_optimizer_state Traceback (most recent call last): File "/home/daniel/localgpu/byteps/example/pytorch/benchmark_byteps.py", line 72, in bps.broadcast_optimizer_state(optimizer, root_rank=0) File "/home/daniel/py38/lib/python3.8/site-packages/byteps/torch/init.py", line 398, in broadcast_optimizer_state p = torch.Tensor([p]).cuda() TypeError: must be real number, not NoneType p = torch.Tensor([p]).cuda() TypeError: must be real number, not NoneType bps.broadcast_optimizer_state(optimizer, root_rank=0) File "/home/daniel/py38/lib/python3.8/site-packages/byteps/torch/init.py", line 398, in broadcast_optimizer_state p = torch.Tensor([p]).cuda() TypeError: must be real number, not NoneType Traceback (most recent call last): File "/home/daniel/localgpu/byteps/example/pytorch/benchmark_byteps.py", line 72, in bps.broadcast_optimizer_state(optimizer, root_rank=0) File "/home/daniel/py38/lib/python3.8/site-packages/byteps/torch/init.py", line 398, in broadcast_optimizer_state p = torch.Tensor([p]).cuda() TypeError: must be real number, not NoneType Traceback (most recent call last): File "/home/daniel/py38/bin/bpslaunch", line 220, in launch_bps() File "/home/daniel/py38/bin/bpslaunch", line 206, in launch_bps t[i].join() File "/home/daniel/py38/bin/bpslaunch", line 34, in join raise self.exc File "/home/daniel/py38/bin/bpslaunch", line 27, in run self.ret = self._target(*self._args, **self._kwargs) File "/home/daniel/py38/bin/bpslaunch", line 176, in worker subprocess.check_call(command, env=my_env, File "/home/daniel/.pyenv/versions/3.8.7/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'python /home/daniel/localgpu/byteps/example/pytorch/benchmark_byteps.py --fp16-pushpull' returned non-zero exit status 1.

@dbonner Thanks for updating the patch. I have not forgotten this PR. I am doing some testing, hopefully I can land it this weekend.

pleasantrabbit commented 3 years ago

Hi @pleasantrabbit , Just checked and this problem still exists. My patch/pull request does fix it. It means that byteps will not work with PyTorch 1.8, which is about to be released:

@dbonner Thanks for updating the patch. I have not forgotten this PR. I am doing some testing, hopefully I can land it this weekend.

@dbonner This patch is missing something. The scalars in optimizer.state_dict() are not broadcast. I am going to add more commits to your PR, would that be ok?

dbonner commented 3 years ago

Yes, that is OK. Thanks. :)

On Sun, 21 Feb 2021, 7:02 pm Yulu Jia, notifications@github.com wrote:

Hi @pleasantrabbit https://github.com/pleasantrabbit , Just checked and this problem still exists. My patch/pull request does fix it. It means that byteps will not work with PyTorch 1.8, which is about to be released:

@dbonner https://github.com/dbonner Thanks for updating the patch. I have not forgotten this PR. I am doing some testing, hopefully I can land it this weekend.

@dbonner https://github.com/dbonner This patch is missing something. The scalars in optimizer.state_dict() are not broadcast. I am going to add more commits to your PR, would that be ok?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bytedance/byteps/pull/363#issuecomment-782816981, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB26QR6XKLEFSMQXDLB3VLTAC4XTANCNFSM4WWOKTLA .

pleasantrabbit commented 3 years ago

reworked this patch and submitted as https://github.com/bytedance/byteps/pull/410

@dbonner