Do a global run of embeddings

brunosan commented 3 months ago

We've been using Clay v1 embeddings directly, and via the Build/Explore apps. We've also done several types of partial benchmarking, so we are starting to feel comfortable with the quality of the model. We therefore should think about making large runs of existing open data and create embeddings, for our benefit to continue learning about Clay, but also to enable the community to leverage these open embeddings.

We still need to make decisions once we decide to make large runs:

Instrument? Sentinel?, NAIP? One Instrument, couple?
Unit of schema? Should do them at the file-level? Or spatial reference?
Spatial resolution? We've seen that many applications need the highest possible spatial resolution. Hence if Sentinel, a small tile size (but not too small to make the embeddings of lower quality). 128x128?
locations, time? Large coverage seems most important, but many users also request temporal changes. So I suggest either only wide spatial coverage, or 80% a large coverage run, and the 20% remaining many snapshots over time.
What format? I propose we wait and follow guidance from @cholmes on https://github.com/cloudnativegeo/geo-embeddings-survey
Hosted? source.coop
License? Open. Is CC-by best? OpenRail-M?
What is the cost of creation? It would be great to come up with a number.

Ideally, we can wrap this code to execute easily down the line, e.g. taking a STAC list and a spec file for chip_size, ... Note: Do not over-scope here, since we have the build app.

Probably out of scope, but the end-state at some point this year could be:

Sentinel-2 annual composites for EU
Sentinel-2 Level-2 files for a deforestation basin in Amazon with as many dates as possibe.
Same as above but Sentinel-1 files, or Landsat composites.
NAIP for whole states once.
NAIP for one state as many years as available.

Filing this also early to allow community requests, but we should aim to set a date for such run, e.g. end of June.

brunosan commented 3 months ago

I was talking with @konstantinklemmer and asking his help to help us make decisions here. Also pinging @cholmes and @bengmstrong @BradNeuberg @Clay-foundation/all and please ping others.

We will need to make decision within a month for the "Big Embeddings run". This will imply lots of decisions that are free to do now and VERY EXPENSIVE to correct later.

My questions, and my suggestions, but none of my suggestions are strongly held.

What instruments? I suggest Sentinel-2 annual composite.
What locations/time? I suggest to start somewhere in South America latest composite, and Amazon basin all available years. Then increase as budget allows.
What chip size? 128x128, with small sections at 64x64, and 256x256.
What output? Start with average of patch embedding, and maybe a separate file with all patch embeddings, and feature maps #291
What format? Geoparquet. I'd follow Earth Index Columns and Metadata here https://github.com/cloudnativegeo/geo-embeddings-survey/blob/main/data/earth_index/readme.md I'll also add Lossess. Either training loss, or a simple loss.
Host/License Hosted on source, with CC-By license.
How much budget to put for this? Let's start churning and see costs. If it deviates a lot from estimates, we rethink. Assuming a $1/hour g5.xlarge instance with an NVIDIA A10, processing batches of 10 Sentinel-2 inputs takes 10 seconds. Each 128x128 chip covers 2.5 square kilometers. This means we can process roughly 360 inputs per dollar. With a $10,000 budget to start with, that translates to a coverage of 9,000,000 square kilometers. Let's put a 50% penalty just bcuase, and it should give us enough for South America??

What are your thought @yellowcap @srmsoumya ? How much effort to pull this on your side? Should we continue trainning v1 first (#283 ) ?

Let's aim to kick this compute off July 15th?

yellowcap commented 3 months ago

If we use worldcover I would suggest a chip size of 100x100 or 200x200, then the chips fit nicely into their 10k x 10k source files. Maybe for Sentinel-2 we would use 100x100 to have a more fine grained resolution. Not sure what kind of features we hope to find based on the embeddings.

Regarding feature maps output, there are 4 feature maps of 32x32 pixels for 768 embeddings, stored as float32. If we assume the input is 4 bands of Sentinel-2 imagery at uint16, then the feature maps are much heavier than the original data. So I would not advise to store the feature maps and rely on running the model at inference time when doing segmentation tasks (did I get this correctly @srmsoumya ?)

Regarding cost we would have to do more test runs to understand it better. We were able to do US level runs already with a reasonable budget, so I think doing some continental scale processing or even global processing should be doable.

Note that the Sentinel-2 composites have limited quality in tropical areas, they are mostly cloud free, but not without haze, and there are small nodata gaps here and there. At least for the Worldcover composites. Happy to look at other sources for composite imagery if people have suggestions.

Finally, I would add at least one NAIP run for all of the US to the wish-list as well.

konstantinklemmer commented 3 months ago

After discussing with @brunosan and thinking a bit more about it, here is my rough "wishlist":

Global coverage embedding map. Chip size is less relevant as long as the whole, continuous planet is covered.
Sentinel-2 would be the preferred sensor; should of course be cloud free.
Ideally two time steps for each location; e.g. January and July (to roughly cover seasonality), but that's secondary.
Major TOM Sentinal 2A might be an option: https://github.com/ESA-PhiLab/Major-TOM

For each observation, ideally we'd have the following data (roughly sketched out): [chip_centroid_lon, chip_centroid_lat, timestamp, chip_thumbnail, clay_embedding, clay_loss]

This "wishlist" is motivated mostly by me wanting to dissect Clay embeddings and see what it learns. Guiding questions are e.g. How does the complexity of embeddings change over space? How representative are embeddings of environmental and human-activity measures? Can Clay embeddings be used as geographic priors?

This would also create a dense embedding database to be used in arbitrary downstream tasks. This allows direct comparison to competitors like MOSAIKS or SatCLIP. The approach would be as follows: Download Clay Embedding with lon/lat closest to downstream location -> Train model y_lonlat = f(ClayEmbedding_lonlat) -> Evaluate.

bengmstrong commented 3 months ago

Very cool that you're gearing up for a global run! Would love to pull/play with your embeddings. I agree that Sentinel-2 annual composites are the right starting point for global embeddings. To enable comparisons with other models it would be nice to use the same public free imagery. We've created/shared global sentinel-2 L2A composites for 2023 here which you are welcome to use (https://beta.source.coop/repositories/earthgenome/sentinel2-temporal-mosaics/description/) but they're a work in progress and do have some quality issues.

One other note @brunosan I think you dropped a factor of 10 in your back of the envelop math. Looks like you should be able to get through 3600 inputs per dollar right? (batch of 10 inputs / 10 seconds 3600 sec/hour 1hr/$) So it might be more affordable than you think!!

brunosan commented 2 months ago

Thanks everyone. I love that we are getting momentum here.

TLDR; So far I'm leaning for a

global run of the Sentinel-2 year composite, at 100px most recent year available with EG Sentinel-2 all-bands composite.
NAIP for CONUS. Latest, with 100px chip size too.
maybe? Selected locations (the training set?) to enable temporal and cross instrument studies.
maybe? Satellogic set

Released as CC-By (inheriting EG CC-By)

Still TBD format and adding what losses.

Source imagery

Thanks @bengmstrong and EG team for the data release. It seems to fit perfectly. Besides the files and the STAC endpoint, this blog post explains the method.

It meets the criteria of:

Fully open license. (CC-By)
Global. (there are mentions of "errors", should be get a black-list of these and run them when fixed? I've spotted checked and I only see the usual hard places like permanent clouds locations)
Recent (2023) (only global open composite this recent).

Notes:

This seems to be a median reduction of the "best" 16 scenes per location. What "best" mean? (least cloudy of the ~35/year?)

Chip Size

Boils down to 50px of 100px in my opinion. Costs grows quadratic since its an area. Also very small areas approach the patch size of 8px which means less chances for the self attention to learn about the surroundings.

Since we use the average, we can recreate embeddings at bigger chip sizes just averaging the smaller ones. It won't be the same sine the patches done with smaller chip sizes will not have paid "self-attention" to the patches outsize of that small chip.

It worries me that an area of 1km^2 is substantially big for many potential uses, limiting the usefulness of this large and expensive run, but doing smaller sizes is too expensive. we can do smaller chip sizes for selected places.

Cost estimates

From our "build" workers (the ones on the co-code app, which we might or might not use for this run), we see that in reality we are getting there ~10k chips/h/worker (we use H100 GPUs, so this the approach of a big GPU with a large batch than a cheaper GPU or CPUs). A worker costs $1800/month ($2.5$/h). Most of the time is spent on downloading, so chip size doesn't seem to be a strong factor.

This would mean 4k chips/$/h/worker.

Uncanny we get pretty much the same result that the napkin exercise (we should become consultants).

chip size (x10m/px)	cost unit (km^2/$/h/worker)	cost to run the world
`50x50` px	1000	$510K
`100x100` px	4000	$127K

50px is too expensive, 100px is doable. I'm hopefully @yellowcap optimism is true for this run. Let's just start and assess what coverage/$ we get.

BradNeuberg commented 2 months ago

Some small suggestions:

Not sure if the scale is possible, but is there a monthly Sentinel-2 global composite available? If a Clay v2 model were trained on such a monthly basis, over several years for example, the model might learn strong seasonal and time based correlations, which would be especially helpful for change detection problems.

In terms of chip size, can that be specified as a kind of metadata fed in as is already done for sensor details? I see varying the chip size even for the same sensor as providing several advantages:

At Planet we’ve noticed that small object detection for embeddings is aided by having smaller chip sizes, such as small buildings or small forest degradation areas. Being able to have a smaller chip size in urban areas, for example, would aid dealing with smaller objects over time
Varying the chip size acts as a regularizer that would force the model to not overfit to a particular chip size
Being able to use varying chip sizes in practice could be a powerful technique - use smaller chip sizes in known urban areas while larger chip sizes in relatively sparse area would trade off accuracy vs compute and storage costs.

In terms of storage, I agree geoparquet is a good format, as well as storing the centroid of the latitude and longitude. At Planet we’ve also stored a geometry column that corresponds to the exact chip bounding box behind an embedding, which can be very helpful for knowing exactly where an embedding is generated from.

Another useful thing to optionally store is a visual product image chip for that embedding, as a preview URL stored along with the embedding. This is a chipped visual product for the underlying analytic imagery and is stored as a PNG file in a Google bucket. This is very useful when presenting results to the user or showing things like clustering results. Not having a preview chip can make it much harder to deal with embeddings at scale.

At Planet we’ve been using 224x224 chips for our embeddings, with a 3m GSD pixel size for PlanetScope. As you’ve found yourself going to smaller chip sizes can significantly increase compute and storage costs. Ultimately we’ve wanted to figure out a way to store something like a pyramid of different representations, something like Matroyshka embeddings but that remains an R&D edge we haven’t figured out yet.

Something else we store in our embedding geoparquet files is quality information per chip, using a cloud and quality mask. This is very helpful to filter embeddings down based on quality, which is especially important for change detection problems over time. You might want to compute cloud and quality info using s2cloudless or fmask and store these in a consistent way. We store percentages for percent of haze, clouds, snow, null pixels, etc and use geopandas to quickly filter embeddings.

On Fri, Jul 5, 2024 at 3:13 AM Bruno Sánchez-Andrade Nuño < @.***> wrote:

Thanks everyone. I love that we are getting momentum here.

TLDR; So far I'm leaning for a

global run of the Sentinel-2 year composite, at 100px most recent year available with EG Sentinel-2 all-bands composite.

NAIP for CONUS. Latest, with 100px chip size too.

maybe? Selected locations (the training set?) to enable temporal and cross instrument studies.

maybe? Satellogic set

Released as CC-By (inheriting EG CC-By)

Still TBD format and adding what losses. Source imagery

Thanks @bengmstrong https://github.com/bengmstrong and EG team for the data release. It seems to fit perfectly. Besides the files https://beta.source.coop/repositories/earthgenome/sentinel2-temporal-mosaics/description/ and the STAC endpoint https://stac.earthgenome.org/, this blog post explains the method https://medium.com/radiant-earth-insights/announcing-public-access-to-our-global-cloud-free-imagery-archive-25b33dc675ec .

It meets the criteria of:

Fully open license. (CC-By)

Global. (there are mentions of "errors", should be get a black-list of these and run them when fixed? I've spotted checked and I only see the usual hard places like permanent clouds locations)

Recent (2023) (only global open composite this recent).

Notes:

This seems to be a median reduction of the "best" 16 scenes per location. What "best" mean? (least cloudy of the ~35/year?)

Chip Size

Boils down to 50px of 100px in my opinion. Costs grows quadratic since its an area. Also very small areas approach the patch size of 8px which means less chances for the self attention to learn about the surroundings.

Since we use the average, we can recreate embeddings at bigger chip sizes just averaging the smaller ones. It won't be the same sine the patches done with smaller chip sizes will not have paid "self-attention" to the patches outsize of that small chip.

It worries me that an area of 1km^2 is substantially big for many potential uses, limiting the usefulness of this large and expensive run, but doing smaller sizes is too expensive. we can do smaller chip sizes for selected places. Cost estimates

From our "build" workers (the ones on the co-code app, which we might or might not use for this run), we see that in reality we are getting there ~10k chips/h/worker (we use H100 GPUs, so this the approach of a big GPU with a large batch than a cheaper GPU or CPUs). A worker costs $1800/month ( $2.5$/h). Most of the time is spent on downloading, so chip size doesn't seem to be a strong factor.

This would mean 4k chips/$/h/worker.

Uncanny we get pretty much the same result that the napkin exercise (we should become consultants).

chip size (x10m/px) cost unit (km^2/$/h/worker) cost to run the world 50x50 px 1000 $510K 100x100 px 4000 $127K

50px is too expensive, 100px is doable. I'm hopefully @yellowcap https://github.com/yellowcap optimism is true for this run. Let's just start and assess what coverage/$ we get.

— Reply to this email directly, view it on GitHub https://github.com/Clay-foundation/model/issues/277#issuecomment-2210604119, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABEHUMVDHPFGTLYJBF7BKLZKZWUNAVCNFSM6AAAAABJFDYINOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJQGYYDIMJRHE . You are receiving this because you were mentioned.Message ID: @.***>

brunosan commented 2 months ago

Update here. We are going to do another v1 training run before the global embeddings run. Follow #283 for details.

brunosan commented 1 month ago

Some update with @yellowcap. We are getting ready building the pipelines and testing the Earth Genome Sentinel-2 composites data:

Data comes in Web mercator projection, which is certainly great for map visualization, but it comes at a cost in terms of data to process. We've trained Clay in a projection that keep GSD across the tile. In Web mercator e.g. Norway is about 77% more pixels than the same near the equator. More pixel for the same feature. Not sure how Embeddings will suffer classifying same object in high latitudes (something to check).
The projection change also implies that there is a nodata boundary around each scene, and that the scene edge are not exacltly horizontal / vertical.

The nominal resolution of the pixels is 9.8 and 19.1 for the 10m and 20m bands (i.e. the resolution of web mercator zoom levels 14 and 13). But this is not the real resolution if you go away from the equator, hence the changes in nr of pixels. So when using a 256x256 pixel image for ML application, one is looking at different sized areas in reality.
In the STAC items some property of the proj extension are missing, for example the proj:shape property, which is required for stacchip. We can work around this. (CC @bengmstrong)

noahgolmant commented 1 week ago

Hi all, has there been validation of any version or checkpoint of this model on an existing benchmark suite such as GEO-Bench? If not what are the major blockers? It seems valuable to do this prior to any global embeddings run because the embeddings cannot be used to run those benchmarks post hoc. And if the benchmark metrics are poor then the embeddings would likely not be very good.

brunosan commented 1 day ago

Clay-foundation / model