1. Support DistributedDataParallel and DataParallel

I'm working on large-scale experiments that takes pretty long for training, and wondering if this framework can support DataParallel and DistributedDataParallel.

The current example/train.py looks like supporting Dataparallel as CustomDataParallel, but returned the following error

Traceback (most recent call last):
  File "examples/train.py", line 369, in <module>
    main(sys.argv[1:])
  File "examples/train.py", line 348, in main
    args.clip_max_norm,
  File "examples/train.py", line 159, in train_one_epoch
    out_net = model(d)
  File "/home/yoshitom/.local/share/virtualenvs/yoshitom-lJAkl1qx/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/yoshitom/.local/share/virtualenvs/yoshitom-lJAkl1qx/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 160, in forward
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
  File "/home/yoshitom/.local/share/virtualenvs/yoshitom-lJAkl1qx/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in replicate
    return replicate(module, device_ids, not torch.is_grad_enabled())
  File "/home/yoshitom/.local/share/virtualenvs/yoshitom-lJAkl1qx/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 140, in replicate
    param_idx = param_indices[param]
KeyError: Parameter containing:
tensor([[[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]],

        [[-10.,   0.,  10.]]], device='cuda:0', requires_grad=True)

(pipenv run python examples/train.py --data ./dataset/ --batch-size 4 --cuda on a machine with 3 GPUs)

When commenting out these two lines https://github.com/InterDigitalInc/CompressAI/blob/master/examples/train.py#L333-L334 , it looks working well

/home/yoshitom/.local/share/virtualenvs/yoshitom-lJAkl1qx/lib/python3.6/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
  warnings.warn("Setting attributes on ParameterList is not supported.")
Train epoch 0: [0/5000 (0%)]    Loss: 183.278 | MSE loss: 0.278 |   Bpp loss: 2.70 |    Aux loss: 5276.71
Train epoch 0: [40/5000 (1%)]   Loss: 65.175 |  MSE loss: 0.096 |   Bpp loss: 2.70 |    Aux loss: 5273.95
Train epoch 0: [80/5000 (2%)]   Loss: 35.178 |  MSE loss: 0.050 |   Bpp loss: 2.69 |    Aux loss: 5271.21
Train epoch 0: [120/5000 (2%)]  Loss: 36.634 |  MSE loss: 0.052 |   Bpp loss: 2.68 |    Aux loss: 5268.45
Train epoch 0: [160/5000 (3%)]  Loss: 26.010 |  MSE loss: 0.036 |   Bpp loss: 2.68 |    Aux loss: 5265.67
...

Could you please fix the issue and also support DistributedDataParallel? If you need more examples to identify the components causing this issue, let me know. I have a few more examples (error messages) for both DataParallel and DistributedDataParallel with different network architectures (containing CompressionModel).

2. Publish Python package

It would be much more useful if you can publish this framework as a Python package so that we can install it with pip install compressai

Thank you!

jbegaint commented 3 years ago

Hi, thanks for the report.

DistributedDataParallel might be supported later, but I don't have a lot of cycles to work on this right now.

I can not reproduce the DataParallel crash, can you share your python/pytorch versions? (looks like python3.6 and torch 1.7.0 but just to confirmed). If you comment the 2 lines you won't be using multiple gpu.

We might release the package on compressai later, but again no eta on this.

yoshitomo-matsubara commented 3 years ago

Hi @jbegaint ,

I was using Python 3.6.9 and torch==1.7.1 on a machine with 3 GPUs. Yes, I understand commenting the 2 lines out does not use multiple GPUs, but just wanted to show that it worked on a single GPU, but not on multiple ones. (confirmed torch.cuda.device_count() returns 3)

Looking forward to the support for DistributedDataParallel and Python package release!

jbegaint commented 3 years ago

thanks for the information! I can't reproduce the DataParallel issue. Can you make sure you have the latest version of compressai installed?

yoshitomo-matsubara commented 3 years ago

You're right, I was using ver. b22bda1f4b9cf61e154ecabd019eb1935cf00822 locally (including example.train.py), but somehow an old version remained in my virtual environment. Reinstalling it resolved the issue with DataParallel on multiple GPUs.

jbegaint commented 3 years ago

ok great! keeping this open for the DistributedDataParallel support.

yoshitomo-matsubara commented 3 years ago

Thank you! I'll be looking forward to the DistributedDataParallel support as it will help us save training time significantly.

jbegaint commented 3 years ago

@yoshitomo-matsubara I've uploaded some test wheels (for linux and macos) on pypi.org, let me know how it goes.

yoshitomo-matsubara commented 3 years ago

@jbegaint It is working well on my machine (Ubuntu 20.04 LTS) and very helpful for me, thank you for publishing it!

As you might know, since this repo is in public, you can automate the process triggered by GitHub release (and so on) to publish Python package, which will lower barrier to publish/manage Python package on pypi.

jbegaint commented 3 years ago

Great :-). Yes, I've setup an automated github action to build and publish the wheels on push tags.

hongge831 commented 3 years ago

When will the code be supported DistributedDataParallel?? looking forward ~

jbegaint commented 3 years ago

Hi, I don;t have an ETA for this yet.

jbegaint commented 3 years ago

We'll revisit DDP support at a later date.

YannDubs commented 3 years ago

I agree that DDP support would be great !

InterDigitalInc / CompressAI

Support DistributedDataParallel and DataParallel, and publish Python package #30

1. Support DistributedDataParallel and DataParallel

2. Publish Python package