Clay-foundation / model

The Clay Foundation Model (in development)
https://clay-foundation.github.io/model/
Apache License 2.0
262 stars 30 forks source link

List and prioritize additional data sources for v1 #128

Closed yellowcap closed 2 months ago

yellowcap commented 6 months ago

To run the model on a more diverse universe of data, we need to list and prioritize new sources. An initial list is the following:

SAR

For creating embeddings

brunosan commented 6 months ago

🎯 There are MANY layers and I'm truly excited to add them. Let's always ensure we do not use any data that is not fully open. Not even from automated Benchmarking. This way we do not carry over any resctriction for our models.

cc @lauracchen

yellowcap commented 6 months ago

Moving data into spreadsheet.

Link for editing (only Clay team)

Link for public download

fmacchiavello commented 5 months ago

https://data.jrc.ec.europa.eu/dataset Joint Research Centre Data Catalogue I think could be helpful!

I know that European Commission’s Joint Research Centre global surface water dataset was useful for Skytruth's mask creation when doing amazon mining watch, to differentiate between mines and water, which often look very similar.

yellowcap commented 4 months ago

We can also potentially leverage imagery datasets that were designed for machine learning, but not necessarily for foundation models. The following reference contains an extensive list

https://captain-whu.github.io/DiRS/

lauracchen commented 4 months ago

Where did we land on which data sources would be incorporated?

@brunosan @danhammer @averycohn Avery was bringing up the fact that there are potentially many oceans-related applications (and funders) that we might want to pursue. Should we consider incorporating an initial attempt for covering oceans for model v1? Or would we just want to create a more oceans-focused use case that falls close to coastlines so that Clay's current geographic coverage with Sentinel would cover it? Or is this just too far out of the scope for now?

yellowcap commented 4 months ago

I would probably separate the two at first. For oceans, the data sources are quite different and the type of thing one would be looking for as well. We are planning to do the v1 pipeline in a way that should be adaptable to other contexts. So we could plan for an ocean model if we list ocean data sources. But for the v1 release I would not mix the two.

ritwikvashistha commented 4 months ago

Hi, I was experimenting with the model to get embeddings and was wondering whether Landsat 5 or 7 images can be used with the v1 to get the embeddings. They might not be part of the data module, but what if I have a geotiff for landsat images? Would that work?

yellowcap commented 4 months ago

Hi @ritwikvashistha yes that should be possible for v1. But v1 is still in development, we are targeting a release in May.

brunosan commented 4 months ago

@lauracchen ... there are potentially many oceans-related applications (and funders) that we might want to pursue. Should we consider incorporating an initial attempt for covering oceans for model v1? Or would we just want to create a more oceans-focused use case that falls close to coastlines so that Clay's current geographic coverage with Sentinel would cover it? Or is this just too far out of the scope for now?

Most data sources we use, open data, either don't map open ocean, or do it at very low resolution. On the other side, I would guess that the semantics of ocean surface are marginal additions to the land.

My take is that we can add all with minimal extra effort coastal data available on Sentinel/Landsat, and Modis at low res. I do not know to what degree this will be enough for many applications. I would suspect it can cover most of coastal applications.

Adding within scope open-ocean would be, unless @yellowcap corrects me, a level of effort beyond v1.

brunosan commented 4 months ago

For overview, this is the Sentinel coverage, that includes substatially all or part of the continental ocean shelves.

image

lauracchen commented 4 months ago

@brunosan thanks! To be clear, does this mean that @yellowcap should update the v1 sampling to include more coastline / continental shelves?

brunosan commented 3 months ago

We have Sentinel-2, Landsat, and Naip as of today. Will ad NZ high res and Sentinel-1 next. After that maybe Modis and Sentinel-3 and others.

yellowcap commented 2 months ago

The list of data we'll use for v1 is as follows, with chip counts from the stacchip index as of today

Platform Chip count
naip 20984171
linz 3299006
sentinel-2-l2a 18683945
landsat-c2l1 5827333
landsat-c2l2-sr 5790651
sentinel-1-rtc 16133394