AmirMansurian / AttnFD

Attention-guided Feature Distillation for Semantic Segmentation
33 stars 3 forks source link

Cityscapes Training #4

Open JanBlume opened 1 month ago

JanBlume commented 1 month ago

Hello Amir,

thanks for your great work.

I am currently trying to recreate the experiments for Cityscapes in my own framework. I used your work and your paper as a template for the transformation steps because they were well described.

Unfortunately, my results still differ significantly from yours, so I'm still looking for a bug in my code.

Maybe you have some time and can answer a few questions so that I can find my bug faster. That would be wonderful!

  1. Do you also used the Nesterov Momentum for the Cityscapes experiments?
  2. What crop size was used? The paper says 512 × 1024, but the RandomScaleCrop function can only crop rectangles? Was it trained with different code or was a crop size like 1024x 1024 used?
  3. What value was used for"base_size"? 1024 (short side of the image) or 512 (default value)?
  4. In general it would be helpful to know what your training command looked like for cityscapes: “python train.py --backbone resnet18 --dataset cityscapes ???”
  5. Was the RandomGaussianBlur used during the training for cityscapes?

Best regards Jan

AmirMansurian commented 1 month ago

Hi

Thank you for reaching out. Unfortunately the codebase works just on Pascal for now. I am planing to update the codebased very soon. Anyway, about your questions, please find the answers below:

1- Yes, I think.

2- about the crop size, I had hardware limitation, so I tried 1024512 instead of 10241024. I hard coded this item in dataloaders/init.py and set the crop size to [512, 1024]. Actually, this is why I haven't upload the codes yet, as they are not clear and I should make the codes better then commit the new changes.

3- I believe that there is no need for this parameter when using cityscapes. Please have a look at dataloaders/datasets/cityscapes.py

4- It should be something like this: python train_kd.py --backbone resnet18 --dataset cityscapes --nesterov --epochs 50 --batch-size 6 --attn_lambda 15

5- I don't think so. Again please have a look at dataloaders/datasets/cityscapes.py

I will update the repo soon. It will be ready to test on Cityscape and the pretrained teacher for cityscapes will be available. Your final result heavily depends on your teacher weights, and sometimes minor difference maybe observed based on the GPU you are using. I myself used a single RTX 3090 for the runs.

Best, Amir

JanBlume commented 1 month ago

Hello Amir, thank you for your fast response. In the meantime, I was able to significantly improve my accuracy and my mIoU, and I noticed a few things that I'd like to share with you.

base Size I found that in dataloaders/datasets/cityscapes.py the base_size is indeed used:

def transform_tr(self, sample):
        composed_transforms = transforms.Compose([
            tr.RandomHorizontalFlip(),
            tr.RandomScaleCrop(base_size=self.args.base_size, crop_size=self.args.crop_size, fill=255),
            tr.RandomGaussianBlur(),
            tr.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
            tr.ToTensor()])

If no base_size is specified during training it defaults to 513. The problem is the Cityscapes images have a resoultion of1024x2048 pixels and would require a base_size of 1024 and not 513. The consequence is seen in the code for RandomScaleCrop:

short_size = random.randint(int(self.base_size * 0.5), int(self.base_size * 2.0))

If the base_size is just half the actual size, it means that you scale just between 0.25 and 1 and not between 0.5 and 2, like mentioned in the paper. The change of the scaling to (0.25,1) improved my accuracy by more than 3%.

RandomGaussianBlur It was indeed used (see code snippet: def transform_tr)

Unwanted constant in CrossEntropyLoss Your calculation of the cross entropy loss looks like this:

def CrossEntropyLoss(self, logit, target):
        n, c, h, w = logit.size()
        criterion = nn.CrossEntropyLoss(weight=self.weight, ignore_index=self.ignore_index,
                                        size_average=self.size_average)
        if self.cuda:
            criterion = criterion.cuda()
        loss = criterion(logit, target.long())

        if self.batch_average:
            loss /= n

        return loss

If I'm not mistaken, the nn.CrossEntropyLoss averages already over the batch and with the "loss/n" it is done a second time. This means that in the gradient calculation, a factor of 1/4 is used (in the case of batch_size = 4). So it should have the same effect as a learning rate that is reduced to 1/4 of its original size. The second problem is that the loss is dependent on the batch size.

Thanks again for your fast response and your work. It was very helpful to me! If my comments are incorrect, please feel free to correct me.

Best, Jan