alibaba / EasyCV

An all-in-one toolkit for computer vision
Apache License 2.0
1.76k stars 199 forks source link

Errors with DistributedDataParallel occurred when we trained the dino model with multiple cards #342

Open ThomasCai opened 5 months ago

ThomasCai commented 5 months ago

Thanks for your error report and we appreciate it a lot.

Checklist

Describe the bug

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 ...
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Traceback (most recent call last):
  File "tools/train.py", line 332, in <module>
    main()
  File "tools/train.py", line 320, in main
    train_model(
  File "/home/thomascai/codes/EasyCV/easycv/apis/train.py", line 328, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/miniconda3/envs/ev/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/thomascai/codes/EasyCV/easycv/runner/ev_runner.py", line 107, in train
    self.run_iter(data_batch, train_mode=True)
  File "/home/thomascai/codes/EasyCV/easycv/runner/ev_runner.py", line 72, in run_iter
    outputs = self.model.train_step(data_batch, self.optimizer,
  File "/home/miniconda3/envs/ev/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 42, in train_step
    and self.reducer._rebuild_buckets()):

To Reproduce

Environment

TorchVision: 0.11.0+cu111 OpenCV: 4.9.0 MMCV: 1.4.4 EasyCV: 0.11.6


* You may add addition that may be helpful for locating the problem, such as
    * How you installed PyTorch [e.g., pip, conda, source]
    * Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)