Closed Erfun76 closed 2 years ago
@Erfun76 How did you solve this problem? May I get your solution?
There need to some fixes in farl/network/farl/model.py and mmseg package
True
, which causes the problem. So the fix is to explicitly set the inplace
parameter to be False
each time ConvModule is get called. laterals[i - 1] += ...
to laterals[i - 1] = laterals[i - 1] + resize(...)
There need to some fixes in farl/network/farl/model.py and mmseg package
- in model.py, change all inplace ReLU to inplace=False 2.in /mnt/.conda/envs/py37/lib/python3.7/site-packages/mmseg/models/decode_heads/uper_head.py, ConvModule's parameters inplace is by default
True
, which causes the problem. So the fix is to explicitly set theinplace
parameter to beFalse
each time ConvModule is get called.- still in uper_head.py, change += to its full form. more specifically, change
laterals[i - 1] += ...
tolaterals[i - 1] = laterals[i - 1] + resize(...)
Thx, it works.
I tried to train face parsing using following code but i got error:
python -m blueprint.run \ farl/experiments/face_parsing/train_lapa_farl-b-ep16_448_refinebb.yaml \ --exp_name farl --blob_root ./blob
====== RUNNING farl/experiments/face_parsing/train_lapa_farl-b-ep16_448_refinebb.yaml ====== blueprint: Parsing farl/experiments/face_parsing/train_lapa_farl-b-ep16_448_refinebb.yaml DistributedGPURun: init_process_group: 0/1 blueprint: Parsing farl/experiments/face_parsing/./trainers/lapa_farl.yaml blueprint: Parsing farl/experiments/face_parsing/./trainers/../augmenters/lapa/train.yaml blueprint: Parsing farl/experiments/face_parsing/./trainers/../augmenters/lapa/test.yaml blueprint: Parsing farl/experiments/face_parsing/./trainers/../augmenters/lapa/test_post.yaml blueprint: Parsing farl/experiments/face_parsing/./trainers/../networks/farl.yaml blueprint: Parsing farl/experiments/face_parsing/./trainers/../scorers/lapa.yaml blueprint: Parsing farl/experiments/face_parsing/./trainers/../optimizers/refine_backbone.yaml Mon Apr 25 11:21:11 2022 - farl_0 - outputs_dir: ./blob/outputs/farl/face_parsing.train_lapa_farl-b-ep16_448_refinebb Mon Apr 25 11:21:11 2022 - farl_0 - states_dir: ./blob/states/farl/face_parsing.train_lapa_farl-b-ep16_448_refinebb Mon Apr 25 11:21:11 2022 - farl_0 - locating the latest loadable state ... Mon Apr 25 11:21:11 2022 - farl_0 - no valid state files found in ./blob/states/farl/face_parsing.train_lapa_farl-b-ep16_448_refinebb Mon Apr 25 11:21:11 2022 - farl_0 - There will be 6056 training steps in this epoch. loss=2.4654557704925537 Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.7/dist-packages/blueprint/run.py", line 69, in
_main()
File "/usr/local/lib/python3.7/dist-packages/blueprint/run.py", line 65, in _main
runnable()
File "/usr/local/lib/python3.7/dist-packages/blueprint/ml/distributed.py", line 123, in call
_single_thread_run, args=(num_gpus, self), nprocs=num_gpus, join=True)
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/usr/local/lib/python3.7/dist-packages/blueprint/ml/distributed.py", line 68, in _single_thread_run local_run() File "/usr/local/lib/python3.7/dist-packages/blueprint/ml/trainer.py", line 194, in call self._backward(loss) File "/usr/local/lib/python3.7/dist-packages/blueprint/ml/trainer.py", line 120, in _backward loss.backward() File "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py", line 156, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [3, 768, 28, 28]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
I'm using colab to run your project. and changed batchsize to 3.