Closed yellowcap closed 4 months ago
I suspect we are also dropping a very substantial share of inputs due a single no-data
pixel invalidating the whole set.
https://github.com/Clay-foundation/model/blob/ae70345395bb541403fda295dadb04fd2b3e191d/scripts/tile.py#L39-L51
aoi = gpd.GeoDataFrame(
pd.DataFrame(["CDL Test Region"], columns=["Region"]),
crs="EPSG:4326",
geometry=[box(-92.30926, 32.17581, -90.01114, 38.63658)], # using lower left and upper right coordinates
)
See: https://github.com/Clay-foundation/office/issues/170#issuecomment-1914173261
For the latlon coordinates embeddings to capture the intended global structure, I believe we must include full global coverage on the training set, which in my opinion means to add full coverage from MODIS, either composite or several times raw images.
Perhaps even train first with modis only to warm up a general latlon embeddings?
For Clay v0.2 we are not planning to change the input platforms. Adding MODIS would require changes in architecture. The idea for v0.2 was to use the same datasources but with a much larger sample.
Ran data collection with code from https://github.com/Clay-foundation/model/pull/173
We have 2535 MGRS tiles successfully processed, the data sits in s3://clay-tiles-04-sample-v02
We can use the current pipeline, but probably with the following changes:
Regarding the MGRS tile increase, the question is if we want to change the ratio of the input. I discussed with @srmsoumya yesterday that we should mabye increase the fraction of the landcover classes with human footprint, i.e. Urban and Agriculture. Presumably that is what users will be most interested in for search. So we could increase the fraction of that to give this more weight.