FacePerceiver / FaRL

FaRL for Facial Representation Learning [Official, CVPR 2022]
https://arxiv.org/abs/2112.03109
MIT License
375 stars 22 forks source link

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation!! #3

Closed Erfun76 closed 2 years ago

Erfun76 commented 2 years ago

I tried to train face parsing using following code but i got error:

python -m blueprint.run \ farl/experiments/face_parsing/train_lapa_farl-b-ep16_448_refinebb.yaml \ --exp_name farl --blob_root ./blob

====== RUNNING farl/experiments/face_parsing/train_lapa_farl-b-ep16_448_refinebb.yaml ====== blueprint: Parsing farl/experiments/face_parsing/train_lapa_farl-b-ep16_448_refinebb.yaml DistributedGPURun: init_process_group: 0/1 blueprint: Parsing farl/experiments/face_parsing/./trainers/lapa_farl.yaml blueprint: Parsing farl/experiments/face_parsing/./trainers/../augmenters/lapa/train.yaml blueprint: Parsing farl/experiments/face_parsing/./trainers/../augmenters/lapa/test.yaml blueprint: Parsing farl/experiments/face_parsing/./trainers/../augmenters/lapa/test_post.yaml blueprint: Parsing farl/experiments/face_parsing/./trainers/../networks/farl.yaml blueprint: Parsing farl/experiments/face_parsing/./trainers/../scorers/lapa.yaml blueprint: Parsing farl/experiments/face_parsing/./trainers/../optimizers/refine_backbone.yaml Mon Apr 25 11:21:11 2022 - farl_0 - outputs_dir: ./blob/outputs/farl/face_parsing.train_lapa_farl-b-ep16_448_refinebb Mon Apr 25 11:21:11 2022 - farl_0 - states_dir: ./blob/states/farl/face_parsing.train_lapa_farl-b-ep16_448_refinebb Mon Apr 25 11:21:11 2022 - farl_0 - locating the latest loadable state ... Mon Apr 25 11:21:11 2022 - farl_0 - no valid state files found in ./blob/states/farl/face_parsing.train_lapa_farl-b-ep16_448_refinebb Mon Apr 25 11:21:11 2022 - farl_0 - There will be 6056 training steps in this epoch. loss=2.4654557704925537 Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.7/dist-packages/blueprint/run.py", line 69, in _main() File "/usr/local/lib/python3.7/dist-packages/blueprint/run.py", line 65, in _main runnable() File "/usr/local/lib/python3.7/dist-packages/blueprint/ml/distributed.py", line 123, in call _single_thread_run, args=(num_gpus, self), nprocs=num_gpus, join=True) File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/usr/local/lib/python3.7/dist-packages/blueprint/ml/distributed.py", line 68, in _single_thread_run local_run() File "/usr/local/lib/python3.7/dist-packages/blueprint/ml/trainer.py", line 194, in call self._backward(loss) File "/usr/local/lib/python3.7/dist-packages/blueprint/ml/trainer.py", line 120, in _backward loss.backward() File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py", line 156, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [3, 768, 28, 28]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

I'm using colab to run your project. and changed batchsize to 3.

btjhjeon commented 2 years ago

@Erfun76 How did you solve this problem? May I get your solution?

pcgreat commented 1 year ago

There need to some fixes in farl/network/farl/model.py and mmseg package

  1. in model.py, change all inplace ReLU to inplace=False 2.in /mnt/.conda/envs/py37/lib/python3.7/site-packages/mmseg/models/decode_heads/uper_head.py, ConvModule's parameters inplace is by default True, which causes the problem. So the fix is to explicitly set the inplace parameter to be False each time ConvModule is get called.
  2. still in uper_head.py, change += to its full form. more specifically, change laterals[i - 1] += ... to laterals[i - 1] = laterals[i - 1] + resize(...)
moimner commented 1 year ago

There need to some fixes in farl/network/farl/model.py and mmseg package

  1. in model.py, change all inplace ReLU to inplace=False 2.in /mnt/.conda/envs/py37/lib/python3.7/site-packages/mmseg/models/decode_heads/uper_head.py, ConvModule's parameters inplace is by default True, which causes the problem. So the fix is to explicitly set the inplace parameter to be False each time ConvModule is get called.
  2. still in uper_head.py, change += to its full form. more specifically, change laterals[i - 1] += ... to laterals[i - 1] = laterals[i - 1] + resize(...)

Thx, it works.