ChenDelong1999 / RemoteCLIP

🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing" (IEEE TGRS)
https://arxiv.org/abs/2306.11029
Apache License 2.0
226 stars 13 forks source link

Mean/Standard deviation over training dataset to remove rogue embedding dimensions outside normal magnitudes? #22

Open BradNeuberg opened 4 months ago

BradNeuberg commented 4 months ago

I've previously been experimenting with the RSICD CLIP model [1], which was trained just over the RSICD dataset. I'm very impressed at how many data sources you've used to train RemoteCLIP, which you document in your paper with this table:

image

In my experiments with RSICD I've found that having the mean and standard deviation of the training data in order to do normalization was an important part of getting quality results, so that input imagery follows the remote sensing distribution set by the training data rather than the mean/std that the parent OpenCLIP model has, as remote sensing imagery obviously has a very different distribution that standard consumer photography.

Since you have so many datasets for your model, which is a strength, that unfortunately makes it quite hard for us to compute our own mean/std. Might it be possible for you to compute a per-band mean and std over the datasets you trained with that you might have locally? It's probably fine to compute this using a sampling strategy as long as the sample size is large enough.

Thanks again for such a great paper and open sourcing your project :)

[1] https://github.com/arampacha/CLIP-rsicd

BradNeuberg commented 4 months ago

If the training data has been aggregated on your side and easily accessible I can do this myself, but I'm not sure if I have access to all of these datasets.

ChenDelong1999 commented 4 months ago

Hi, actually in RemoteCLIP we use the default normalization mean/std for training, which can be found in: https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/constants.py#L1-L2

We haven't done any ablation on these mean/std. Since the final performance is not bad so we don't see this as a major bottleneck, also using default parameters can keep the input images in the domain that CLIP vision encoders are familiar.

But it will be interesting to dive deeper into that! Thank you for your suggestion and we will investigate it further in future works.

BradNeuberg commented 4 months ago

Thanks! When working with RSICD CLIP [1] we actually found some interesting "rogue dimensions", where even after normalization some vector components were dominating across all generated embeddings. We found that if we normalized these rogue dimensions we got much better class separability and performance on downstream tasks:

image

This is what rogue dimensions look like (on the left). A handful of the 512 dimensions have values that are routinely 2-10x larger than other dimensions. We can choose to treat these by calculating the mean and standard deviation for each dimension over a large enough batch of vectors and then rescaling those dimensions, giving the image on the right. In the right-hand case, each dimension has zero mean and unit standard deviation.

This has the property that when taking differences between vectors, the vector magnitudes are not dominated by the rogue dimensions. This can lead to cleaner cluster separation and better downstream performance on zero-shot tasks.

We suspect that the following paper found this problem first, "All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality" [2], and they found that normalizing and removing these rogue dimensions is important in an applied setting.

We do this normalization using Scikit Learns StandardScaler [3], taking the RSICD training dataset, getting image and text embeddings on them, and then getting batch statistics on it to get appropriate means and standard deviations per dimension of the 512 length embedding. This means we end up with 512-length means and standard deviations for images and the same for text embeddings, which we can then use to normalize embeddings to remove rogue dimensions.

We suspect that RemoteCLIP will suffer from these same rogue dimensions, and be greatly helped by normalizing and removing them. Unfortunately, RemoteCLIP was trained on a very diverse set of data that we don't have access to.

If you all are able to, it would greatly help if you could randomly sample some subset of your training data, compute their image and text embeddings, and then provide these 512-length mean and standard deviation values for images and texts. Others have found this kind of normalization important. We are fine to do it ourselves but then would need access to your training corpus. You could also randomly sample some representative subset of your training images and text and put it in a cloud bucket of some kind and we could compute those statistics and provide them here.

[1] https://huggingface.co/flax-community/clip-rsicd-v2 [2] https://arxiv.org/abs/2109.04404 [3] https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

ChenDelong1999 commented 4 months ago

Hmm, that's very interesting, and thanks for your comments! 😀

Not sure if you do normalization on each single embedding before evaluation? (e.g., https://github.com/ChenDelong1999/RemoteCLIP/blob/main/retrieval.py#L138-L139)

For data samples, the publicly available RSICD, RSITMD, and UCM datasets have both images and captions, and they are used for training RemoteCLIP.

I guess using them can give a good enough estimation of full RemoteCLIP training data, which I won't be able to release before paper acceptance : (

Btw, I am very curious about how much can this rogue dimension normalization operation improve retrieval/zero-shot performance?

BradNeuberg commented 4 months ago

We’ve noticed that normalization seems to increase class separability when clustering but haven’t gotten to gauge its effect on actual downstream performance yet. Since you have a test harness might interesting for you to see if it gives you an accuracy bump. We are still building our own test harness for our particular problem, but increased class separability tends to be a proxy of improved downstream performance.

I agree we can probably just use the same values as RSICD. Would you like me to provide the values we calculated for that ourselves and you can test in your own test harness to see if it increases your accuracy numbers? Could be a nice bump for your paper!

On Tue, Mar 5, 2024 at 3:24 AM Delong Chen (陈德龙) @.***> wrote:

Hmm, that's very interesting, and thanks for your comments! 😀

Not sure if you do normalization on each single embedding before evaluation? (e.g., https://github.com/ChenDelong1999/RemoteCLIP/blob/main/retrieval.py#L138-L139 )

For data samples, the publicly available RSICD, RSITMD, and UCM datasets have both images and captions, and they are used for training RemoteCLIP.

I guess using them can give a good enough estimation of full RemoteCLIP training data, which I won't be able to release before paper acceptance : (

Btw, I am very curious about how much can this rogue dimension normalization operation improve retrieval/zero-shot performance?

— Reply to this email directly, view it on GitHub https://github.com/ChenDelong1999/RemoteCLIP/issues/22#issuecomment-1978539368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABEHUPA2W5P4KPDHUML55TYWWTN3AVCNFSM6AAAAABD6UFL66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZYGUZTSMZWHA . You are receiving this because you authored the thread.Message ID: @.***>

ChenDelong1999 commented 4 months ago

I think calculating the values is not very complicated, we could first try by ourselves and investigate it further.

I am now super curious about which types of inputs can make these rogue dimensions activate 🤔

Thank you very much for the discussion, let us try try and get back to you if we have some results!

BradNeuberg commented 4 months ago

Do you have an email address? I could send you some sample Planet images that we’ve been testing with.

On Tue, Mar 5, 2024 at 5:40 PM Delong Chen (陈德龙) @.***> wrote:

I think calculating the values is not very complicated, we could first try by ourselves and investigate it further.

I am now super curious about which types of inputs can make these rogue dimensions activate 🤔

Thank you very much for the discussion, let us try try and get back to you if we have some results!

— Reply to this email directly, view it on GitHub https://github.com/ChenDelong1999/RemoteCLIP/issues/22#issuecomment-1979925375, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABEHUMY2MJETPYLQ5ZAUPTYWZXZ5AVCNFSM6AAAAABD6UFL66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZZHEZDKMZXGU . You are receiving this because you authored the thread.Message ID: @.***>