Peta-training - Githubissues

sbharadwajj commented 5 years ago

Hi,

I have a few doubts regarding PETA's training.

Are all the optimizers used Adam? Including the attention modules?
After the predictions are made from the model, is a threshold used to get either class?
What are the learning rates and weight decay of the optimizers of attention modules?
What are the class weights used for the 35 classes using the weighted focal loss?

nsarafianos commented 5 years ago

Yes and yes
For WIDER where mAP is reported only the probabilities are used. For PETA dataset the prob. threshold is 0.5
The weight decay is 0.0005, the initial lr=1e-4 and divided by 10 (two times in total) when there's no improvement on the validation set for more than 8 epochs. Some information can also be found here
If the number of positive training samples for an attribute is >=50% of the total training samples then the weight is equal to 1. If less then you need to compute how extra you need to penalize misclassifications. For example, if you have 100 positive and 900 negative samples then the corresponding class weight will be 9 since you need to penalize this loss x9. When training on PETA I would first suggest you train with binary cross entropy loss using these weights to observe the impact of this cost-sensitive learning. Then play with focal loss w/o and w/ these weights.

sbharadwajj commented 5 years ago

Hi Nikolaos, Thank you for acknowledging.

As you suggested I trained it using Binary cross entropy and first froze the weights of the primary network and trained only the attention module.

I noticed that after about 40 epochs, the train loss and validation accuracy smoothened out and stopped decreasing/increasing respectively. Is this the same trend you got?
I am using a resnet50 backbone instead of resnet101, any inputs on this?

nsarafianos commented 5 years ago

Hi Shrisha,

I do not remember how long it took to be honest. 40 epochs for only the attention mechanism sounds not too much especially if you're using a scheduler for the learning rate but again I'm not sure. Let me run some experiments over the weekend (I'll use the ResNet-50 backbone as you do) and I'll get back to you early next week.

sbharadwajj commented 5 years ago

For the ResNet-50, I have plugged the attention modules at the 3rd (1024 features) and 4th (2048 features) layer. Oh, and I have used Pytorch.

Sure. That would be great!

nsarafianos commented 5 years ago

@chichilicious Apologies for not getting back to you earlier I have not forgotten you. I had to work on some things that showed up this past week. I will hopefully get back to you with numbers on Tuesday.

sbharadwajj commented 5 years ago

@nsarafianos Thank you for remembering and yes, I understand. I shall wait until next week.

nsarafianos commented 5 years ago

Hi Shrisha,

I found some time and ran experiments with the PETA dataset.

Resized all images to 128x256 (width x height) since we have pedestrians so it makes sense to make them rectangular
Trained with Nesterov accelerated gradient descent with initial learning rate equal to 2e-3 and a learning rate scheduler (wd=0.0004, momentum=0.9)
Data augmentation includes resizing to x1.25 size and then getting random 128x256 crops as well as random (w/ 50% prob) horizontal flips
I used a pre-trained on ImageNet ResNet-50 and trained the backbone with both w/o and w/ the class weights. I froze the backbone and trained the attention mechanism which was placed where you suggested. You are correct that it takes less than 50 epochs to converge since we don't have that many parameters. I unfroze the backbone and finetuned everything together. The results I obtained seem in line with those in Table 4 in the paper.

That's all. I hope it's helpful.

peta

sbharadwajj commented 5 years ago

Hi Nikolaos,

Thank you so much for taking your time off and running the experiments.

Isn't the Adam optimizer used for the main network and attention modules for PETA? Nesterov gradient descent is SGD right?

Thank you again for the tabulated results.

nsarafianos commented 5 years ago

Hi Shrisha,

In the paper we used Adam everywhere for the PETA dataset whereas in those yesterday I switched to Nesterov SGD because it was the first one I found. For the results above I did not try to find the best optimizer/hyperparameters etc so your results might differ a little.

As for Nesterov Accelerated Gradient, it's slightly different than the original SGD in terms of how the updates are performed. There's an abundance of literature that explains it better than I will (for example here and here)

sbharadwajj commented 5 years ago

Ah, alright. Then I will experiment with both the optimizers and observe the results. Thank you, I will go through the slides as well.

Thanks for all the guidance Nikolaos, it helped me a lot! :100:

valencebond commented 5 years ago

@chichilicious @nsarafianos sorry to bother, i find the network in this repo is not same as the paper said. Can I ask if your experimental results correspond to the network in the paper or the network of this repo?

nsarafianos commented 5 years ago

Hi @valencebond,

This repo, as well as the paper, have a ResNet-101 backbone with attention modules plugged in at different levels of the backbone. In this thread, we were talking about a ResNet-50 since that's what the initial question was about :)

Let me know if you have any questions and I would be more than happy to help.

valencebond commented 5 years ago

thanks for your reply @nsarafianos , in this repo, attention mask is C512-C512-Cn which kernel is 3, but in paper, attention mask is C256-C256-C which kernel is 1. there is also different in subnetwork C256-C512-C1024 compared to paper settings C256-C512-C512. So i am confused experimental result settings you discussed。

nsarafianos commented 5 years ago

Oh my bad then. I'm out of office at this moment but I will check tomorrow and get back to you with an updated response.

valencebond commented 5 years ago

@nsarafianos thank you so much, looking forward to your reply

nsarafianos commented 5 years ago

Hi @valencebond,

I just checked and yes you're right. The results in the paper were obtained with what is reported on the supplementary material (and not with what's here). The differences (# of nodes in the layer, and the kernel size) should have a minimal impact on the final performance so I expect small differences. If your results are more than 1-1.5% off then please let me know and I can run experiments again to double check.

In any case, please keep me posted :)

valencebond commented 5 years ago

@nsarafianos thanks for your clear explanation again。but the critical point is，in my experiments results which follow paper settings, when i use a pre-trained on ImageNet ResNet-50 and trained the backbone (50 epoch with SGD,lr 0.001)，i can get results below without data augmentation. so i am so confused about my results. Without a same baseline, i can't verify the effect of following class weight or attention module。And there is no reimplement results in this repo，i also can't find the code error in my reimplement.

valencebond commented 5 years ago

@chichilicious can you share your code or experiment result in PETA? thanks a lot

nsarafianos commented 5 years ago

Hi @valencebond ,

Given that this is a ResNet-50 and without any augmentation I would say that your results look reasonable from here. Moving to 101 or 152 will improve your performance by 2-3%.
Also if I remember correctly, since these datasets are not that big, data augmentation (resize to x1.25 and then grab random crops + horizontal flipping) is almost essential to improve the performance. Please take a look also at this paper for some data augmentation details.
The only thing I would change both in this paper and in our paper is the input dimensions (no need to be square, it can easily be 128x256) and second, you can pre-train on a re-id task which will provide a better initialization compared to ImageNet.

Follow-up Q since I might have missed it: Are these results you posted above on WIDER or PETA?

sbharadwajj commented 5 years ago

@valencebond can you email me at shrishabharadwaj5@gmail.com, I can help you out with the code. I achieved 86 as f1 and I think the mAP was around 83. I am not very sure about the mAP score.

I used weight focal loss + BCE and the attention loss.
I also zero padded the images and then resized it to 224x224.
I used Adam with 0.0001 learning rate as mentioned here

cvcode18 / imbalanced_learning

Peta-training #7