Run data collection for Clay v0.2

Clay-foundation / model

The Clay Foundation Model (in development)

https://clay-foundation.github.io/model/

Apache License 2.0

262 stars 30 forks source link

Run data collection for Clay v0.2 #142

Closed yellowcap closed 4 months ago

yellowcap commented 5 months ago

We can use the current pipeline, but probably with the following changes:

Reduce chip size to 256x256 pixels
Add more time steps (can do all available years)
Increase the number of MGRS tiles by 3-5 times.

Regarding the MGRS tile increase, the question is if we want to change the ratio of the input. I discussed with @srmsoumya yesterday that we should mabye increase the fraction of the landcover classes with human footprint, i.e. Urban and Agriculture. Presumably that is what users will be most interested in for search. So we could increase the fraction of that to give this more weight.

brunosan commented 5 months ago

I suspect we are also dropping a very substantial share of inputs due a single no-data pixel invalidating the whole set. https://github.com/Clay-foundation/model/blob/ae70345395bb541403fda295dadb04fd2b3e191d/scripts/tile.py#L39-L51

aoi = gpd.GeoDataFrame(
    pd.DataFrame(["CDL Test Region"], columns=["Region"]),
    crs="EPSG:4326",
    geometry=[box(-92.30926, 32.17581, -90.01114, 38.63658)],  # using lower left and upper right coordinates
)

See: https://github.com/Clay-foundation/office/issues/170#issuecomment-1914173261

brunosan commented 5 months ago

For the latlon coordinates embeddings to capture the intended global structure, I believe we must include full global coverage on the training set, which in my opinion means to add full coverage from MODIS, either composite or several times raw images.

Perhaps even train first with modis only to warm up a general latlon embeddings?

yellowcap commented 5 months ago

For Clay v0.2 we are not planning to change the input platforms. Adding MODIS would require changes in architecture. The idea for v0.2 was to use the same datasources but with a much larger sample.

yellowcap commented 4 months ago

Ran data collection with code from https://github.com/Clay-foundation/model/pull/173

We have 2535 MGRS tiles successfully processed, the data sits in s3://clay-tiles-04-sample-v02