Clay-foundation / model

The Clay Foundation Model (in development)
https://clay-foundation.github.io/model/
Apache License 2.0
299 stars 38 forks source link

Increase tile size to 512x512 pixels #78

Closed yellowcap closed 9 months ago

yellowcap commented 9 months ago

For easier use in benchmark applications, we should increase the tile size to 512 pixels squared. This does not have large implications for the model training and can help with benchmarking. It will also reduce the number of files to be stored.

mattpaul commented 9 months ago

Hi @yellowcap, just looking for some clarity - can you confirm that:

if so, @weiji14 is it safe to assume that:

and therefore we now expect:

weiji14 commented 9 months ago

The patch size might change from 32x32 to 64x64 (to be decided). We are still serving a single embedding of size 1x768 for each 512x512 chip, which is the mean/average taken along the spatial dimension, see discussion at https://github.com/Clay-foundation/model/pull/56#discussion_r1411249922.

The (1, 256+1, 768) embedding would be the unaveraged embedding. I shared that initially to see the in-chip distribution of the embeddibgs, sorry if that was confusing.

yellowcap commented 9 months ago

@mattpaul the tiles here are indeed the "chips". The 512 size is just for collection to match better against benchmark datasets. For the model we will most likely still feed it 256x256 chips. In any case the output embedding size is a hyperparameter of the model, which is relatively independent from the input size.

mattpaul commented 9 months ago

@yellowcap thanks for the clarification. I guess I was assuming that whenever we use the term "chip" it would mean an image of the same size regardless of context but that's not necessarily always the case. Good to know!

I gather there are benchmark datasets out there that we would like to compare against or otherwise work with, and they typically have chips of size 512x512, therefore we're modifying our chip factory to process MSRG tiles into chips of similar size for our training dataset of images.

When you say "for the model we will most likely still feed it 256x256 chips," this suggests that there is an additional model input transform that occurs between the chip factory output of 512x512 and the model input layer which will likely continue to expect input of 256x256 = 65535 pixels, correct? Perhaps something as simple as consuming one 512x512 chip and producing four 256x256 chips with which to feed the model.

To your point, output embedding size is a hyperparameter of the model and is ultimately determined by the dimensionality of the latent space / embedding space, in our case = 768, correct? Or is there any dimensionality reduction happening between embedding space and exported embeddings?

yellowcap commented 8 months ago

When you say "for the model we will most likely still feed it 256x256 chips," this suggests that there is an additional model input transform that occurs between the chip factory output of 512x512 and the model input layer which will likely continue to expect input of 256x256 = 65535 pixels, correct? Perhaps something as simple as consuming one 512x512 chip and producing four 256x256 chips with which to feed the model.

Yes that is correct, although the v0 model now actually uses the 512x512 size directly after all. This is something we will iterate over for v1. Its an optimisation question from a model point of view.

To your point, output embedding size is a hyperparameter of the model and is ultimately determined by the dimensionality of the latent space / embedding space, in our case = 768, correct? Or is there any dimensionality reduction happening between embedding space and exported embeddings?

Not at the moment, we are keeping the "full resolution" of the embeddings as they come out of the model.