Code for fine-tuning subnets for a downstream task

tommiekerssies commented 2 years ago

Hi, I can't find the code used to fine-tune a subnet to a downstream task. Specifically, I want to reproduce the results for CIFAR-100. Is the code for that available somewhere?

NickChang97 commented 2 years ago

You can refer to classification test in https://github.com/open-mmlab/mmselfsup/tree/openselfsup. But you need to do some change to run with iteration based runner not epoch based runner

tommiekerssies commented 2 years ago

After many days of trial and error, I cannot reproduce your results for CIFAR-100. Would you mind sharing the code? The link you shared does not contain the exact code you used I suppose.

NickChang97 commented 2 years ago

After many days of trial and error, I cannot reproduce your results for CIFAR-100. Would you mind sharing the code? The link you shared does not contain the exact code you used I suppose.

No problem, I can provide you with the training config

NickChang97 commented 2 years ago

After many days of trial and error, I cannot reproduce your results for CIFAR-100. Would you mind sharing the code? The link you shared does not contain the exact code you used I suppose.

GAIA-ssl/configs/classification/cifar100/downstream_task_finetune_cifar100.py You can refer to this config. If you encounter something unexpected, please let me know.

tommiekerssies commented 2 years ago

Thank you very much, that config clears up a lot of things for me! :)

tommiekerssies commented 2 years ago

I successfully reproduced the results for CIFAR-100 with the provided config. However, how you found the subnet provided in that config is not clear to me. I find that the computed distance from the provided subnet to the supernet is quite average; from a random sample of only a few (2-3) subnets I can already find a better one. Would you be so kind to also share the config for finding the provided subnet? Maybe I'm doing something wrong.

NickChang97 commented 2 years ago

I successfully reproduced the results for CIFAR-100 with the provided config. However, how you found the subnet provided in that config is not clear to me. I find that the computed distance from the provided subnet to the supernet is quite average; from a random sample of only a few (2-3) subnets I can already find a better one. Would you be so kind to also share the config for finding the provided subnet? Maybe I'm doing something wrong.

Yes. you are right, we also found this when I want extend journal version. The domain really affects a lot. In the small scale downstream, the better usage of this supernet is training subnets on downstream and search on its validation performance directly.

tommiekerssies commented 1 year ago

Actually I find that the smallest subnet usually has the lowest distance. Independent of the domain. Could you explain why that is?

NickChang97 commented 1 year ago

I indeed don't use the smallest subnet to do a pilot experiments. However, I check the architecture searched in COCO and cityscapes. Not smallest subnet has the lowest distance. Do you find this phenomen in some classification datasets? I am sorry I am not clear about this. Maybe the smallest subnet and the largest subnet are always sampled during training because of sandwich rule?

tommiekerssies commented 1 year ago

Specifically for dense distance I find this actually.

tommiekerssies commented 1 year ago

On CIFAR-10 and a custom dataset I have.

NickChang97 commented 1 year ago

I am sorry I really have no idea for this. What's your opnion? We can discuss it here.

tommiekerssies commented 1 year ago

I'm not sure yet. I did find some problem in your code I think. For dense distance, you compute the distance between q and k. You use the forward method of dynamic_moco with mode='extract'. You specify extract_from as encoder_q or encoder_k. However, in extract mode, it always uses self.backbone and does not use the extract_from parameter. So you're actually not computing the similarity of relative relations between student and teacher as described in the paper, or am I missing something here?

NickChang97 commented 1 year ago

I'm not sure yet. I did find some problem in your code I think. For dense distance, you compute the distance between q and k. You use the forward method of dynamic_moco with mode='extract'. You specify extract_from as encoder_q or encoder_k. However, in extract mode, it always uses self.backbone and does not use the extract_from parameter. So you're actually not computing the similarity of relative relations between student and teacher as described in the paper, or am I missing something here?

Yes, you are right. It's my fault to release a wrong version. This released part about dense prediction is indeed wrong. I have fixed it. Sorry for that.

NickChang97 commented 1 year ago

You can use cityscapes dataset to validate the results in the paper. It only has about 3k images, you can verfity the searched architecture in 1G~2G FLOPs group.

tommiekerssies commented 1 year ago

Thanks again. I was wondering. If dense distance is also cosine distance, how come the values are much larger than 1? Doesn't cosine distance have to be in the range [-1,1]?

NickChang97 commented 1 year ago

Thanks again. I was wondering. If dense distance is also cosine distance, how come the values are much larger than 1? Doesn't cosine distance have to be in the range [-1,1]?

Only the distance_metric == 'kl' is valid, because the dimension of feature may be different between subnet and the largest network. The code for distance_metric == 'cosine' is wrong because that's a dot product of two unnormalized vector

GAIA-vision / GAIA-ssl

Code for fine-tuning subnets for a downstream task #3