braun-steven / DAFNe

Code for our paper "DAFNe: A One-Stage Anchor-Free Deep Model for Oriented Object Detection".
MIT License
60 stars 12 forks source link

TypeError: div() got an unexpected keyword argument 'rounding_mode' #5

Closed Levaru closed 2 years ago

Levaru commented 2 years ago

When running the command ./tools/run.py --gpus 0 --config-file ./configs/dota-1.0/1024.yaml I receive the following error:

[02/03 12:13:18 d2.data.build]: Removed 5478 images with no usable annotations. 10271 images left. [02/03 12:13:18 d2.data.build]: Distribution of instances among all 15 categories: category #instances category #instances category #instances
plane 16411 baseball-di.. 823 bridge 3632
ground-trac.. 838 small-vehicle 49810 large-vehicle 36233
ship 61019 tennis-court 4736 basketball-.. 1118
storage-tank 9621 soccer-ball.. 849 roundabout 800
harbor 13078 swimming-pool 3323 helicopter 1188
total 203479

[02/03 12:13:18 d2.data.build]: Using training sampler RepeatFactorTrainingSampler [02/03 12:13:18 d2.data.common]: Serializing 10271 elements to byte tensors and concatenating them all ... [02/03 12:13:19 d2.data.common]: Serialized dataset takes 15.84 MiB [02/03 12:13:19 detectron2]: Starting training from iteration 0 /app/detectron2_repo/detectron2/structures/masks.py:368: UserWarning: This overload of nonzero is deprecated: nonzero() Consider using one of the following signatures instead: nonzero(, bool as_tuple) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:983.) item = item.nonzero().squeeze(1).cpu().numpy().tolist() /app/detectron2_repo/detectron2/structures/masks.py:368: UserWarning: This overload of nonzero is deprecated: nonzero() Consider using one of the following signatures instead: nonzero(, bool as_tuple) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:983.) item = item.nonzero().squeeze(1).cpu().numpy().tolist() /app/detectron2_repo/detectron2/structures/masks.py:368: UserWarning: This overload of nonzero is deprecated: nonzero() Consider using one of the following signatures instead: nonzero(, bool as_tuple) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:983.) item = item.nonzero().squeeze(1).cpu().numpy().tolist() /app/detectron2_repo/detectron2/structures/masks.py:368: UserWarning: This overload of nonzero is deprecated: nonzero() Consider using one of the following signatures instead: nonzero(, bool as_tuple) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:983.) item = item.nonzero().squeeze(1).cpu().numpy().tolist() Traceback: File "./tools/plain_train_net.py", line 597, in main do_train(cfg, model, resume=args.resume) File "./tools/plain_train_net.py", line 450, in do_train loss_dict = model(data) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 881, in _call_impl result = self.forward(*input, *kwargs) File "/app/dafne/dafne/modeling/one_stage_detector.py", line 47, in forward return super().forward(batched_inputs) File "/app/detectron2_repo/detectron2/modeling/meta_arch/rcnn.py", line 301, in forward images = ImageList.from_tensors(images, self.backbone.size_divisibility) File "/app/detectron2_repo/detectron2/structures/image_list.py", line 88, in from_tensors max_size = (max_size + (stride - 1)).div(stride, rounding_mode="floor") stride

Error: div() got an unexpected keyword argument 'rounding_mode' Moving output directory from /app/results/dafne/22-02-03_13:12_default-1643890350 to /app/results/dafne/22-02-03_13:12_default-1643890350_error Traceback (most recent call last): File "./tools/plain_train_net.py", line 660, in launch( File "/app/detectron2_repo/detectron2/engine/launch.py", line 82, in launch main_func(args) File "./tools/plain_train_net.py", line 651, in main raise e File "./tools/plain_train_net.py", line 597, in main do_train(cfg, model, resume=args.resume) File "./tools/plain_train_net.py", line 450, in do_train loss_dict = model(data) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 881, in _call_impl result = self.forward(input, *kwargs) File "/app/dafne/dafne/modeling/one_stage_detector.py", line 47, in forward return super().forward(batched_inputs) File "/app/detectron2_repo/detectron2/modeling/meta_arch/rcnn.py", line 301, in forward images = ImageList.from_tensors(images, self.backbone.size_divisibility) File "/app/detectron2_repo/detectron2/structures/image_list.py", line 88, in from_tensors max_size = (max_size + (stride - 1)).div(stride, rounding_mode="floor") stride TypeError: div() got an unexpected keyword argument 'rounding_mode'

According to this detectron2 issue this seems to be a problem with the pytorch version. When trying to build the project with this command docker build -t dafne . I had multiple problems. I had to add the keyword --upgrade to the pip installation of tensorboard because I was getting this error:

ERROR: Exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/pip/_internal/cli/base_command.py", line 165, in exc_logging_wrapper status = run_func(*args) File "/opt/conda/lib/python3.8/site-packages/pip/_internal/cli/req_command.py", line 205, in wrapper return func(self, options, args) File "/opt/conda/lib/python3.8/site-packages/pip/_internal/commands/install.py", line 389, in run to_install = resolver.get_installation_order(requirement_set) File "/opt/conda/lib/python3.8/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 188, in get_installation_order weights = get_topological_weights( File "/opt/conda/lib/python3.8/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 276, in get_topological_weights assert len(weights) == expected_node_count AssertionError The command '/bin/sh -c pip install tensorboard' returned a non-zero code: 2

I also tried a different pytorch version by changing this line in the docker file: FROM nvcr.io/nvidia/pytorch:21.02-py3 but this only lead to other errors. Did the detectron2 project change since this repository was last updated? Are there any packages without a specific version number involved in the build process that might have changed?

braun-steven commented 2 years ago

Hey, thanks for catching this!

Yes, you're right. Detectron2 seems to be incompatible with the PyTorch version used here and I seemed to have forgotten to specify a detectron2 (and tensorboard) version in the Dockerfile. I've just fixed this in the latest commit and tested it on a fresh Docker image. I was able to reproduce your reported behavior and the latest commit fixes it.

Thanks for your report! Please close this issue when you can confirm, that it works for you as well.

Levaru commented 2 years ago

Yes, thank you very much! It looks like the issue is fixed, the training is now currently running with an ETA of 1 day.