Alibaba-MIIL / ASL

Official Pytorch Implementation of: "Asymmetric Loss For Multi-Label Classification"(ICCV, 2021) paper
MIT License
723 stars 101 forks source link

Questions on reproducing the reported results on MS COCO #30

Closed shuijundadoudou closed 3 years ago

shuijundadoudou commented 3 years ago

Hi,

First, thank you for sharing the exciting work.

I was trying to reproduce the results on MS COCO dataset based on my own training framework. When I used cross entropy loss loss_function=AsymmetricLoss(gamma_neg=0,** gamma_pos=0, clip=0) to achieve the baseline. The result (with backbone of ResNet101) of mAP ~82.5% was achieved, which is quite similar to the result reported in Fig. 8 of the paper.

Then, I replaced the loss function with loss_function=AsymmetricLoss(gamma_neg=4, gamma_pos=1, clip=0.05) -- all other hyper parameters were kept consistent. However, I only got the mAP result of ~82.1%.

Also, the traditional focal loss loss_function=AsymmetricLoss(gamma_neg=2, gamma_pos=2, clip=0) can not outperform the baseline (~82.5%), given the same configurations. I am curious about the issue of my training process.

Could you also please share some training tricks? For example, a snippet of code on adjusting learning rate, training transforms similar to that used for validation here, etc. Or, is there any suggestions?

Thank you.

mrT23 commented 3 years ago

We honestly haven't encountered any case where ASL has not outperformed easily cross entropy.

here are some training tricks we used (they are quite standard and can be found also in public repositories like this), see if something resonant differently from your framework:

that's what I can think of at the top of my head.

shuijundadoudou commented 3 years ago

We honestly haven't encountered any case where ASL has not outperformed easily cross entropy.

some training tricks (most to them are quite standard), see if something resonant differently from your framework:

  • for learning rate, we use one cycle policy (warmup + cosine decay) with Adam optimizer and max learning rate of ~2e-4 to 4e-4
  • very important to use also EMA
  • true weight decay of 1e-4 ("true" == no wd for batch norm and bias)
  • we have our own augmentation package, but important to use at least standard AutoAugment.
  • cutout of 0.5 (very important)
  • squish resizing, not crop (important)
  • try replacing resnet with TResNet. it will give you the same GPU speed, with higher accuracy

that's what I can think of at the top of my head.

Thank you for the information.

As I tried the cross entropy loss and the proposed ASL with same training configurations -- with only standard data augmentations, I suppose ASL should have produced at least better results. For these training tricks, I do believe they have potentials to improve the performance (both cross entropy and ASL).

If I apply these tricks to ASL training, I also need to apply them to traditional cross entropy training to verify that the performance improvement comes from ASL (instead of these tricks). So, I am wondering if the experiments (for the same backbone) that produce the results in Fig. 8 used all these tricks (e.g. EMA, AutoAugment)? Also, from my understanding, both cross entropy and ASL in Fig. 8 are initialized by the corresponding models (ResNet/TresNet) trained on ImageNet, right?

Besides, could you please also specify some more details for the following hyper parameters, so that I do not have to try them all:

Thank you!

mrT23 commented 3 years ago

I think that our approaches ("philosophy" :) ) for deep learning are a bit different

"training tricks" is a bit underwhelming name for the most important thing in deep learning. they are not extra methods your should choose whether to use or not. they are the essence, the bread-and-butter. i would be very proud if someone categorizes ASL as a good training trick.

training without proper augmentations, for example, is unacceptable in my opinion. the training would quickly overfit, no matter what your loss function is. EMA is also essential in any modern deep learning scheme. it usually outscores the regular model by ~1%, and generalize better. when your baseline training tricks are not well calibrated, any additional trick, even if effective, will not give the full impact. without augmentations to prevent overfit, no loss function will save you.

my answers to your questions:

all of the training tricks i mentioned are used in this repository: https://github.com/rwightman/pytorch-image-models and i recommend reviewing it.

all the best Tal

SonDaoDuy commented 3 years ago

Hello @shuijundadoudou, can you share the training setup to get ~ 82.5 mAP for the cross-entropy loss? for example, the optimizer, learning rate augmentation, and any training tricks?

wuhy08 commented 3 years ago

We honestly haven't encountered any case where ASL has not outperformed easily cross entropy.

some training tricks (most to them are quite standard), see if something resonant differently from your framework:

  • for learning rate, we use one cycle policy (warmup + cosine decay) with Adam optimizer and max learning rate of ~2e-4 to 4e-4
  • very important to use also EMA
  • true weight decay of 1e-4 ("true" == no wd for batch norm and bias)
  • we have our own augmentation package, but important to use at least standard AutoAugment.
  • cutout of 0.5 (very important)
  • squish resizing, not crop (important)
  • try replacing resnet with TResNet. it will give you the same GPU speed, with higher accuracy

that's what I can think of at the top of my head.

Hi @mrT23 , thank you for the details.

When you say "squish resizing", do you mean that unlock the aspect ratio during resizing such that objects are stretched?

What is the reason behind that? Is that because we could potentially crop some objects out of the image?

Since COCO dataset has positional info (bounding boxes, masks), does it make sense to use such info to mark certain objects as negative if those objects are cropped out?

mrT23 commented 3 years ago

read more about squish vs crop resizing here: image https://forums.fast.ai/t/resize-instead-of-crop/28680/10 https://docs.fast.ai/vision.augment.html#Resize-with-crop,-pad-or-squish

crop-resizing works basically only on ImageNet, because the vast majority of objects there are zoomed-in and in the center. its a common mistake to automatically use ImageNet resizing to other datasets. however, squish-resizing usually gives better results, and is less prone for catastrophic mistakes

wuhy08 commented 3 years ago

Thank you @mrT23 for the elaboration. Really helpful.

adam-dziedzic commented 3 years ago

Hello @shuijundadoudou, can you share the training setup to get ~ 82.5 mAP for the cross-entropy loss? for example, the optimizer, learning rate augmentation, and any training tricks?

This would be helpful. By the way, I think that @shuijundadoudou used the asymmetric loss (ASL), though.

mrT23 commented 3 years ago

I agree.

We cannot share our training code as-is due to commercial limitations, but once a public code will be shared, we can try to help improve it and get results similar to the ones in the article

GhostWnd commented 3 years ago

I also want to reproduce the result on MSCOCO, but due to my limitation on GPU resourses and limited time, I resize the img to 224*224. I use no training tricks and set the learning rate 1e-4 constantly. At first I thought even if I can't get mAP 86.6, I can get at least mAP 70, which is enough for me. But after about 100 iteration in Epoch 1, the loss decerases from 230 to 90 and doesn't decrease anymore, the mAP on validation is 6, very low. I wonder whether it's because I don't implement the tricks or I should not resize the images to 224 or simply the code is wrong.

Here are my training code, and I don't change other files like tresnet.py or so, I am a deep learning beginner and I will be very grateful if you point out my problems, it has puzzled me for quite a long time.

-- coding: utf-8 --

parser = argparse.ArgumentParser(description='PyTorch ImageNet Training') parser.add_argument('data', metavar='DIR', help='path to dataset') parser.add_argument('--model-name', default='tresnet_l') parser.add_argument('--model-path', default='./TRresNet_L_448_86.6.pth', type=str) parser.add_argument('--num-classes', default=80) parser.add_argument('-j', '--workers', default=4, type=int, metavar='N', help='number of data loading workers (default: 16)') parser.add_argument('--image-size', default=224, type=int, metavar='N', help='input image size (default: 448)') parser.add_argument('--thre', default=0.8, type=float, metavar='N', help='threshold value') parser.add_argument('-b', '--batch-size', default=32, type=int, metavar='N', help='mini-batch size (default: 16)') parser.add_argument('--print-freq', '-p', default=64, type=int, metavar='N', help='print frequency (default: 64)')

def main(): args = parser.parse_args() args.batch_size = args.batch_size

# setup model
print('creating model...')
args.do_bottleneck_head = False
model = create_model(args).cuda()
model.train()
print('done\n')

# Data loading code
normalize = transforms.Normalize(mean=[0, 0, 0],
                                 std=[1, 1, 1])

instances_path_val = os.path.join(args.data, 'annotations/instances_val2017.json')
instances_path_train = os.path.join(args.data, 'annotations/instances_train2017.json')

data_path_val = os.path.join(args.data, 'val2017')
data_path_train = os.path.join(args.data, 'train2017')
val_dataset = CocoDetection(data_path_val,
                            instances_path_val,
                            transforms.Compose([
                                transforms.Resize((args.image_size, args.image_size)),
                                transforms.ToTensor(),
                                normalize,
                            ]))
train_dataset = CocoDetection(data_path_train,
                            instances_path_train,
                            transforms.Compose([
                                transforms.Resize((args.image_size, args.image_size)),
                                transforms.ToTensor(),
                                normalize,
                            ]))

print("len(val_dataset)): ", len(val_dataset))
print("len(train_dataset)): ", len(train_dataset))
val_loader = torch.utils.data.DataLoader(
    val_dataset, batch_size=args.batch_size, shuffle=False,
    num_workers=args.workers, pin_memory=False)
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=args.batch_size, shuffle=True,
    num_workers=args.workers, pin_memory=False)

criterion = AsymmetricLoss()
params = model.parameters()
optimizer = torch.optim.Adam(params, lr=0.0001)
total_step = len(train_loader)

highest_mAP = 0
trainInfoList = []
Sig = torch.nn.Sigmoid()
for epoch in range(5):
    for i, (inputData, target) in enumerate(train_loader):
        model.train()
        inputData = inputData.cuda()
        target = target.cuda()
        target = target.max(dim=1)[0]
        #output = torch.nn.Sigmoid(model(inputData))
        output = Sig(model(inputData))
        loss = criterion(output, target)

        model.zero_grad()
        loss.backward()
        optimizer.step()

        #store information
        if i % 10 == 0:
            trainInfoList.append([epoch, i, loss.item()])
            print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
                  .format(epoch, 5, i, total_step, loss.item()))

        if (i+1) % 100 == 0:
            #store model
            torch.save(model.state_dict(), os.path.join(
                'models/', 'model-{}-{}.ckpt'.format(epoch+1, i+1)))
            mAP_score = validate_multi(val_loader,  model, args) #use the same validate program as validate.py
            if mAP_score > highest_mAP:
                highest_mAP = mAP_score
                print('current highest_mAP = ', highest_mAP)
                torch.save(model.state_dict(), os.path.join(
                        'models/', 'model-highest.ckpt')) 

if name == 'main': main()

1006927966 commented 3 years ago

@GhostWnd hello你的问题发现什么原因了吗 我也出现了这个问题 谢谢

ChenAnno commented 2 years ago

We honestly haven't encountered any case where ASL has not outperformed easily cross entropy.

here are some training tricks we used (they are quite standard and can be found also in public repositories like this), see if something resonant differently from your framework:

  • for learning rate, we use one cycle policy (warmup + cosine decay) with Adam optimizer and max learning rate of ~2e-4 to 4e-4
  • very important to use also EMA
  • true weight decay of 1e-4 ("true" == no wd for batch norm and bias)
  • we have our own augmentation package, but important to use at least standard AutoAugment.
  • cutout of 0.5 (very important)
  • squish resizing, not crop (important)
  • try replacing resnet with TResNet. it will give you the same GPU speed, with higher accuracy

that's what I can think of at the top of my head.

Great thanks for your details, which really helps me a lot!