Detection tutorial failed to converge on another dataset

ephemeral1m commented 1 year ago

Hi, I used another dataset with spacing [0.18, 0.18, 0.18], I changed the intensity_transform parameters, the config parameters(including anchors) and made my dataset_fold[i].json accordingly, I followed all the instructions for training LUNA16 including the same resampling step. In addition, I added some codes to visualize the patch right before they entered the neural-network. I ran the training script provided in the tutorial both on LUNA16 and the other dataset. The LUNA16 did converge well but the other dataset failed to converge. And by checking if those patches match the GT boxes using the code mentioned before, both datasets looks fine right before they enter the network. The config_train.json file I used to run the other datset is provided below, most of the sizes of GT boxes are around(3mm, 3mm, 3mm), since I didn't find any notes about the unit of the "base_anchor_shapes" I also tried [[2, 2, 2], [4, 4, 4], [8, 8, 8]], but neither one worked unfortunately : { "gt_box_mode": "cccwhd", "lr": 1e-2, "spacing": [0.18, 0.18, 0.18], "batch_size": 4, "patch_size": [128,128,128], "val_patch_size": [256,256,256], "fg_labels": [0], "n_input_channels": 1, "spatial_dims": 3, "score_thresh": 0.02, "nms_thresh": 0.22, "returned_layers": [1,2], "conv1_t_stride": [2,2,2], "base_anchor_shapes": [[16,16,16],[45,45,45],[100,100,100]], "balanced_sampler_pos_fraction": 0.3 } Shortcut of results on the other dataset: 144/150, train_loss: 0.6934 145/150, train_loss: 0.4031 146/150, train_loss: 0.4272 147/150, train_loss: 0.5918 148/150, train_loss: 0.5977 149/150, train_loss: 0.3677 150/150, train_loss: 0.3833 Training time: 405.1036174297333s epoch 20 average loss: 0.5507 saved last model Validation time: 531.1751585006714s 2022-11-10 17:02:25,108 - Start COCO metric computation... 2022-11-10 17:02:25,122 - Statistics for COCO metrics finished (t=0.01s). 2022-11-10 17:02:25,122 - COCO metrics computed in t=0.01s. {'mAP_IoU_0.10_0.50_0.05_MaxDet_100': 0.0, 'nodule_mAP_IoU_0.10_0.50_0.05_MaxDet_100': 0.0, 'AP_IoU_0.10_MaxDet_100': 0.0, 'nodule_AP_IoU_0.10_MaxDet_100': 0.0, 'mAR_IoU_0.10_0.50_0.05_MaxDet_100': 0.0, 'nodule_mAR_IoU_0.10_0.50_0.05_MaxDet_100': 0.0, 'AR_IoU_0.10_MaxDet_100': 0.0, 'nodule_AR_IoU_0.10_MaxDet_100': 0.0} current epoch: 20 current metric: 0.0000 best metric: 0.0108 at epoch 5

Can-Zhao commented 1 year ago

Thank you for reaching out. Unit of base_anchor_shapes is pixel/voxel. So if your spacing is 0.18mm and nodule size is 3mm, the nodule size would be 16.67 voxels. The base_anchor_shapes should be around [16, 16, 16]. So maybe set it to "base_anchor_shapes": [[16,16,16],[14,14,14],[18,18,18]], (just an example). Please also include your smallest nodule size in "base_anchor_shapes".

Also, you could set debug=True in ./luna16_training.py as:

detector = RetinaNetDetector(
        network=net, anchor_generator=anchor_generator, debug=True
    ).to(device)

So you could see how many positive/negative samples are used for training in each batch. The log message will also give some guidence on tuning hyperparameters

Sere1nz commented 10 months ago

@Can-Zhao Hi, the default resampled spacing for detection is Spacingd(keys=["image", "label"], pixdim=[0.703125, 0.703125, 1.25] I wonder how [0.703125, 0.703125, 1.25] was chosen?

And also I'm doing a nodule segmentation task, should I also resample it to [0.703125, 0.703125, 1.25] or other pixdim like spleen task is[1.5,1.5,2]? In either case, how should I choose patch size ( eg.(64,64,32) or (96,96,96)) based on the spacing? I am so confused. Thank you very much.

Can-Zhao commented 10 months ago

@Can-Zhao Hi, the default resampled spacing for detection is Spacingd(keys=["image", "label"], pixdim=[0.703125, 0.703125, 1.25] I wonder how [0.703125, 0.703125, 1.25] was chosen?

And also I'm doing a nodule segmentation task, should I also resample it to [0.703125, 0.703125, 1.25] or other pixdim like spleen task is[1.5,1.5,2]? In either case, how should I choose patch size ( eg.(64,64,32) or (96,96,96)) based on the spacing? I am so confused. Thank you very much.

This parameter was from nnDetection.

Project-MONAI / tutorials

Detection tutorial failed to converge on another dataset #1033