Training issues in docker

tianyufang1958 commented 1 year ago

Hello, I would be really appreciated if you could guild me to run the training with my own dataset. I have changed all the coco.py to single class and run the following steps. Could you please advise me if I ran in the correct order also the error I got?

Firstly: 'python main.py --gpus 2 --max_epochs 300' successfully ran with output model files.

Secondly: python main.py --resume work_dirs/coco/last.ckpt --label_dump_path pseudo_labels/labels --not_eval_mask --val_only I am not sure if I need --box_inputs in this step. I got two files 'labels' and 'labels.result'. I found 'label's got all the json format so I changed the file name to labels.json and use its path in the config file in the next step.

data = dict( samples_per_gpu=2, workers_per_gpu=2, train=dict( type=dataset_type, ann_file='../pseudo_labels/labels.json', img_prefix=data_root + 'train/', pipeline=train_pipeline), val=dict( type=dataset_type, ann_file=data_root + 'annotations/instances_val.json', img_prefix=data_root + 'val/', pipeline=test_pipeline), test=dict( type=dataset_type, ann_file='data/coco_plot_data/annotations/instances_val.json', img_prefix='data/coco_plot_data/val/', pipeline=test_pipeline))

Thirdly: bash tools/dist_train.sh configs/MALMask/solov2_r50_fpn_3x_coco_mal.py but I got the following error: I also tried single gpu with python train.py, but still the same error. I am bit concerned about this error and I didn't find solutions on line.

2023-04-04 13:35:42,492 - mmdet - INFO - workflow: [('train', 1)], max: 36 epochs 2023-04-04 13:35:42,492 - mmdet - INFO - Checkpoints will be saved to /workspace/mal_vol/mask-auto-labeler/mmdet/work_dirs/solov2_r50_fpn_3x_coco_mal by HardDiskBackend. /opt/conda/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.) return _VF.meshgrid(tensors, kwargs) # type: ignore[attr-defined] Traceback (most recent call last): File "tools/train.py", line 244, in main() File "tools/train.py", line 233, in main train_detector( File "/opt/conda/lib/python3.8/site-packages/mmdet/apis/train.py", line 244, in train_detector runner.run(data_loaders, cfg.workflow) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run epoch_runner(data_loaders[i], kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train self.run_iter(data_batch, train_mode=True, kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter outputs = self.model.train_step(data_batch, self.optimizer, File "/opt/conda/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 77, in train_step return self.module.train_step(inputs[0], kwargs[0]) File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 248, in train_step losses = self(data) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 119, in new_func return old_func(args, kwargs) File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 172, in forward return self.forward_train(img, img_metas, kwargs) File "/opt/conda/lib/python3.8/site-packages/mmdet/models/detectors/single_stage_instance_seg.py", line 127, in forward_train mask_loss = self.mask_head.forward_train( File "/opt/conda/lib/python3.8/site-packages/mmdet/models/dense_heads/base_mask_head.py", line 62, in forward_train loss = self.loss( File "/opt/conda/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 208, in new_func return old_func(args, *kwargs) File "/opt/conda/lib/python3.8/site-packages/mmdet/models/dense_heads/solov2_head.py", line 578, in loss loss_cls = self.loss_cls( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, *kwargs) File "/opt/conda/lib/python3.8/site-packages/mmdet/models/losses/focal_loss.py", line 233, in forward loss_cls = self.loss_weight calculate_loss_func( File "/opt/conda/lib/python3.8/site-packages/mmdet/models/losses/focal_loss.py", line 139, in sigmoid_focal_loss loss = _sigmoid_focal_loss(pred.contiguous(), target.contiguous(), gamma, File "/opt/conda/lib/python3.8/site-packages/mmcv/ops/focal_loss.py", line 59, in forward ext_module.sigmoid_focal_loss_forward( RuntimeError: sigmoid_focal_loss_forward_impl: implementation for device cuda:0 not found.

voidrank commented 1 year ago

Oh... I think you have to rebuild the docker. MMCV bas its own CUDA operators, therefore if the driver version doesn't match my original driver version, it would fail.

tianyufang1958 commented 1 year ago

Thanks for reply. It is a pain to re build the docker as there might be many version issues. g++, gcc, cuda toolkit... But I will have a go. Do you know which nvidia driver you used? I also have other questions I will appreciate if you can reply as well. Thanks!

tianyufang1958 commented 1 year ago

Oh... I think you have to rebuild the docker. MMCV bas its own CUDA operators, therefore if the driver version doesn't match my original driver version, it would fail.

In your docker image, the nvcc --version is 11.5, which different from my ubuntu system. So I don't think this is issue if I ran the code in the docker image?

voidrank commented 1 year ago

You can rebuild the docker on your side. That solves everything.

voidrank commented 1 year ago

@tianyufang1958 is the issue resolved?

tianyufang1958 commented 1 year ago

@voidrank Thanks for asking. I modified the docker file as below and it was built successfully without compatibility issue. I have one more question.

Do I just need to generate the pseudo labels for training datasets or for both training and val datasets? Based on your config file, pseudo labels were only applied to training data, but please confirm.

FROM nvcr.io/nvidia/pytorch:21.12-py3 RUN apt-get update RUN apt-get install -y htop vim tmux gcc g++ psmisc iputils-ping RUN apt-get install -y libgl1 zip RUN pip install opencv-python==4.5.5.64 RUN pip install scipy==1.6.3 pytorch_lightning==1.8.1 torchmetrics==0.10.2 mmcv-full==1.7.0 mmdet==2.25.3 mmcls==0.24.1 RUN pip install einops==0.6.0 shapely RUN pip install git+https://github.com/lvis-dataset/lvis-api.git RUN python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

tianyufang1958 commented 1 year ago

Also I don't quite understanding the error during training process 'ERROR - The testing results of the whole dataset is empty. '. Could you please explain?

[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 58/57, 3.9 task/s, elapsed: 15s, ETA: 0s

2023-04-07 12:15:13,802 - mmdet - INFO - Evaluating bbox... Loading and preparing results... 2023-04-07 12:15:13,802 - mmdet - ERROR - The testing results of the whole dataset is empty. 2023-04-07 12:15:13,804 - mmdet - INFO - Epoch(val) [1][29] 2023-04-07 12:15:16,035 - mmdet - INFO - Saving checkpoint at 2 epochs [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 58/57, 17.3 task/s, elapsed: 3s, ETA: 0s

2023-04-07 12:15:19,753 - mmdet - INFO - Evaluating bbox... Loading and preparing results... 2023-04-07 12:15:19,753 - mmdet - ERROR - The testing results of the whole dataset is empty. 2023-04-07 12:15:19,755 - mmdet - INFO - Epoch(val) [2][29] 2023-04-07 12:15:21,960 - mmdet - INFO - Saving checkpoint at 3 epochs

voidrank commented 1 year ago

Because your test dataset is empty

tianyufang1958 commented 1 year ago

Because your test dataset is empty

@voidrank Does it affect the traning? I think i set test dataset the same path as validation dataset.

tianyufang1958 commented 1 year ago

Because your test dataset is empty

@voidrank Another question is do i need to generate pseudo labels for both training and validation datasets or just just training datasets? They are in coco format in seperate folders and each has a json file.

voidrank commented 1 year ago

Doesn’t matter if you train it correctly. Validation in training is only used for early checking the training recipe is correct or not.
That depends if you have validation mask or not for second-stage validation. Usually autolabeling training labels is enough

tianyufang1958 commented 1 year ago

Doesn’t matter if you train it correctly. Validation in training is only used for early checking the training recipe is correct or not.

That depends if you have validation mask or not for second-stage validation. Usually autolabeling training labels is enough

@voidrank for 2, no I don't have validation mask as GT for validation dataset. All I have is bounding boxes for both training and validation dataset. So I guess I will need to generate pseudo masks for both training and val?

voidrank commented 1 year ago

If you don’t have any gt mask for validation set, validating masks using pseudo labels makes less sense to me. However, you still can generate pseudo labels for val to debug

tianyufang1958 commented 1 year ago

If you don’t have any gt mask for validation set, validating masks using pseudo labels makes less sense to me. However, you still can generate pseudo labels for val to debug

@voidrank My understanding is firstly use the whole imaging dataset with bbox to train the MAL then inference the model to the whole imaging dataset to generate the pseudo masks. After that the dataset can be splits into training and validation like 80% and 20% for the phase 2 training. Could you please confirm if this is correct?

voidrank commented 1 year ago

That sounds like a reasonable plan if you believe MAL labels are 100% accurate which is almost true in most cases

tianyufang1958 commented 1 year ago

Because your test dataset is empty

I can confirm this is due to the large learning rate. I reduced it to 0.001 and it runs smoothly.

NVlabs / mask-auto-labeler

Training issues in docker #7