Clay-foundation / model

The Clay Foundation Model (in development)
https://clay-foundation.github.io/model/
Apache License 2.0
297 stars 37 forks source link

Find good initial representation of spatial embeddings #12

Closed yellowcap closed 5 months ago

yellowcap commented 10 months ago

We had fruitful discussions about how embeddings, and generally agree that we should experiment with hierarchical embeddings for testing novel model architectures with big potential improvements in certain use cases.

Insigts from discussion so far below.

Using XYZ tiles as input to the model

Pro

Con

Hierarchical encoding

Pro

Con

brunosan commented 10 months ago

Thanks for this. It's a good summary.

The only bit I'll add is that imposing an explicit spatial scheme allows to the other "trick" that I think can help a lot (the other one is absolute anchors) which is shared-semantics.

I really like the idea that few semantics are truly local, most things are similar at regional, continental or even global levels.n E.g. "forests" are everywhere, even when locally forests look somewhat different. If we impose these shared levels of semantics, I believe we can reap outsized benefits when we scale this up to global models.

The challenge is that when you share part of an embedding with other locations, you might induce a lot of noise, specially those parts that are shared across all mebeddings (global semantics). I believe we can overcome that problem using much smaller values proportional to how much they are shared. This way only global semantics are learned at the global level, where many gradients on the same direction add up.

Another way to explain this. I'll use a random location-size embeddings (zxys for clarity here):

z15-x23-y562 = [ "20 float numbers"] but that tile's grand-grand-parent is: z5-x23-y67 = [ "20 float numbers" ] which is shared with other 4^10 =1,048,576 locations at z15. and that tile's grand-grand parent is z0-x0-y0 = [ "20 float numbers" ] which is shared with all 4^15 tiles at z15

So the Full z15-x23-y562 embedding is the concatenation of the chain:

z15-x23-y562 = [ *[ "20 float numbers"], *[z5-x23-y67], *[z0-x0-y0]]

During back propagation we update the weights divigin the learning rate by e.g. how many tiles are in common

z15-x23-y562 += grandient * learning_rate * [ *[ "20 float numbers"], (*[z5-x23-y67])/4^10, (*[z0-x0-y0])/4^15]

yellowcap commented 5 months ago

We have settled for a simple sine/cosine transformation of plain lat/lon for v0.1 and v0.2