DanielMing123 / OCCFusion

[IEEE TIV] OccFusion: Multi-Sensor Fusion Framework for 3D Semantic Occupancy Prediction
https://ieeexplore.ieee.org/document/10663967
9 stars 0 forks source link

Error during training #3

Open M0L4N opened 2 weeks ago

M0L4N commented 2 weeks ago

rank0: Traceback (most recent call last): rank0: File "tools/train.py", line 135, in

rank0: File "tools/train.py", line 131, in main

rank0: File "/home/zxt/anaconda3/envs/OCCFusion/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1777, in train rank0: model = self.train_loop.run() # type: ignore rank0: File "/home/zxt/anaconda3/envs/OCCFusion/lib/python3.8/site-packages/mmengine/runner/loops.py", line 98, in run

rank0: File "/home/zxt/anaconda3/envs/OCCFusion/lib/python3.8/site-packages/mmengine/runner/loops.py", line 115, in run_epoch rank0: self.run_iter(idx, data_batch) rank0: File "/home/zxt/anaconda3/envs/OCCFusion/lib/python3.8/site-packages/mmengine/runner/loops.py", line 131, in run_iter rank0: outputs = self.runner.model.train_step( rank0: File "/home/zxt/anaconda3/envs/OCCFusion/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step rank0: losses = self._run_forward(data, mode='loss') rank0: File "/home/zxt/anaconda3/envs/OCCFusion/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward rank0: results = self(data, mode=mode) rank0: File "/home/zxt/anaconda3/envs/OCCFusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/home/zxt/anaconda3/envs/OCCFusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/home/zxt/anaconda3/envs/OCCFusion/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward rank0: else self._run_ddp_forward(*inputs, kwargs) rank0: File "/home/zxt/anaconda3/envs/OCCFusion/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward rank0: return self.module(*inputs, *kwargs) # type: ignoreindex: File "/home/zxt/anaconda3/envs/OCCFusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/home/zxt/anaconda3/envs/OCCFusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/home/zxt/anaconda3/envs/OCCFusion/lib/python3.8/site-packages/mmdet3d/models/segmentors/base.py", line 102, in forward rank0: return self.loss(inputs, data_samples) rank0: File "/home/zxt/OCCFusion/occfusion/main.py", line 144, in loss rank0: loss = dict(level0_loss = torch.nan_to_num(self.loss_fl(vox_fl_predict_lvl0,vox_fl_label_lvl0)) + \ rank0: File "/home/zxt/anaconda3/envs/OCCFusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/home/zxt/anaconda3/envs/OCCFusion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/home/zxt/anaconda3/envs/OCCFusion/lib/python3.8/site-packages/focal_loss/focal_loss.py", line 77, in forward rank0: assert torch.all((x >= 0.0) & (x <= 1.0)), ValueError( rank0: AssertionError: The predictions values should be between 0 and 1, make sure to pass the values to sigmoid for binary classification or softmax for multi-class classification

I just changed backbone to resnet50 could anyone help me plz

DanielMing123 commented 1 week ago

Please change the "load_from = 'ckpt/r101_dcn_fcos3d_pretrain.pth'" into "load_from=ckpt/resnet50-0676ba61.pth" in the config file. You need to manually download resnet50-0676ba61.pth and put it into the ckpt folder.

M0L4N commented 1 week ago

Thank you for your reply, but the error still occurred after I changed it. Is there any other information I can provide?

DanielMing123 commented 1 week ago

Did you encounter this error at the very beginning of the training?

M0L4N commented 1 week ago

its always occur during training

DanielMing123 commented 1 week ago

Looks like a numerical stability problem. You can try using mmdet3d's focal loss or turn off the AMP setting in the train.py

M0L4N commented 1 week ago

Thanks for your advice. I'll try it.