Errors occur at the beginning of training

NJUSTghw commented 1 year ago

/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.) return _VF.meshgrid(tensors, kwargs) # type: ignore[attr-defined] /home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.) return _VF.meshgrid(tensors, kwargs) # type: ignore[attr-defined] Traceback: File "/media/XXW/98A0C693A0C676F2/ubuntu-temp/DAFNe-master/tools/plain_train_net.py", line 601, in main do_train(cfg, model, resume=args.resume) File "/media/XXW/98A0C693A0C676F2/ubuntu-temp/DAFNe-master/tools/plain_train_net.py", line 468, in do_train optimizer.step() File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper return wrapped(*args, kwargs) File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/optim/optimizer.py", line 140, in wrapper out = func(*args, *kwargs) File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/optim/optimizer.py", line 23, in _use_grad ret = func(self, args, kwargs) File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/optim/sgd.py", line 151, in step sgd(params_with_grad, File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/optim/sgd.py", line 202, in sgd func(params, File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/optim/sgd.py", line 229, in _single_tensor_sgd d_p = d_p.add(param, alpha=weight_decay)

Error: add(): argument 'alpha' must be Number, not NoneType Traceback: File "/media/XXW/98A0C693A0C676F2/ubuntu-temp/DAFNe-master/tools/plain_train_net.py", line 601, in main do_train(cfg, model, resume=args.resume) File "/media/XXW/98A0C693A0C676F2/ubuntu-temp/DAFNe-master/tools/plain_train_net.py", line 468, in do_train optimizer.step() File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper return wrapped(*args, kwargs) File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/optim/optimizer.py", line 140, in wrapper out = func(*args, *kwargs) File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/optim/optimizer.py", line 23, in _use_grad ret = func(self, args, kwargs) File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/optim/sgd.py", line 151, in step sgd(params_with_grad, File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/optim/sgd.py", line 202, in sgd func(params, File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/optim/sgd.py", line 229, in _single_tensor_sgd d_p = d_p.add(param, alpha=weight_decay)

Error: add(): argument 'alpha' must be Number, not NoneType Traceback (most recent call last): File "./tools/plain_train_net.py", line 664, in launch( File "/media/XXW/98A0C693A0C676F2/ubuntu-temp/DAFNe-master/detectron2/engine/launch.py", line 67, in launch mp.spawn( File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, args) File "/media/XXW/98A0C693A0C676F2/ubuntu-temp/DAFNe-master/detectron2/engine/launch.py", line 126, in _distributed_worker main_func(args) File "/media/XXW/98A0C693A0C676F2/ubuntu-temp/DAFNe-master/tools/plain_train_net.py", line 655, in main raise e File "/media/XXW/98A0C693A0C676F2/ubuntu-temp/DAFNe-master/tools/plain_train_net.py", line 601, in main do_train(cfg, model, resume=args.resume) File "/media/XXW/98A0C693A0C676F2/ubuntu-temp/DAFNe-master/tools/plain_train_net.py", line 468, in do_train optimizer.step() File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper return wrapped(*args, kwargs) File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/optim/optimizer.py", line 140, in wrapper out = func(*args, *kwargs) File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/optim/optimizer.py", line 23, in _use_grad ret = func(self, args, kwargs) File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/optim/sgd.py", line 151, in step sgd(params_with_grad, File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/optim/sgd.py", line 202, in sgd func(params, File "/home/XXW/anaconda3/envs/DAFNe/lib/python3.8/site-packages/torch/optim/sgd.py", line 229, in _single_tensor_sgd d_p = d_p.add(param, alpha=weight_decay) TypeError: add(): argument 'alpha' must be Number, not NoneType

Can you make some suggestions？ Thank you very much!

braun-steven commented 1 year ago

Which command did you run? Do you have the requirements correctly installed? Please provide more information.

NJUSTghw commented 1 year ago

I have installed the requirements correctly. I tried to train the model without Docker. I use the command as follows: NVIDIA_VISIBLE_DEVICES=0,1 DAFNE_DATA_DIR=./data/dota-split/ ./tools/plain_train_net.py --num-gpus 2 --config-file ./configs/dota-1.5/1024.yaml MODEL.WEIGHTS ./dota-1.5-r101-ms.pth

braun-steven commented 1 year ago

https://github.com/facebookresearch/detectron2/issues/3964

The issue is that you're using a newer detectron2 version. In the Dockerfile instruction, I specifically install the 0.5 version (https://github.com/steven-lang/DAFNe/blob/b13912041a263904cf26ca5f3468c6bc64ce800c/Dockerfile#L21) which does not have this issue. I do not support newer versions of detectron2 as I'm not actively maintaining this repository anymore.

braun-steven / DAFNe

Errors occur at the beginning of training #11