Closed yellowcap closed 2 months ago
🎯 There are MANY layers and I'm truly excited to add them. Let's always ensure we do not use any data that is not fully open. Not even from automated Benchmarking. This way we do not carry over any resctriction for our models.
cc @lauracchen
https://data.jrc.ec.europa.eu/dataset Joint Research Centre Data Catalogue I think could be helpful!
I know that European Commission’s Joint Research Centre global surface water dataset was useful for Skytruth's mask creation when doing amazon mining watch, to differentiate between mines and water, which often look very similar.
We can also potentially leverage imagery datasets that were designed for machine learning, but not necessarily for foundation models. The following reference contains an extensive list
Where did we land on which data sources would be incorporated?
@brunosan @danhammer @averycohn Avery was bringing up the fact that there are potentially many oceans-related applications (and funders) that we might want to pursue. Should we consider incorporating an initial attempt for covering oceans for model v1? Or would we just want to create a more oceans-focused use case that falls close to coastlines so that Clay's current geographic coverage with Sentinel would cover it? Or is this just too far out of the scope for now?
I would probably separate the two at first. For oceans, the data sources are quite different and the type of thing one would be looking for as well. We are planning to do the v1 pipeline in a way that should be adaptable to other contexts. So we could plan for an ocean model if we list ocean data sources. But for the v1 release I would not mix the two.
Hi, I was experimenting with the model to get embeddings and was wondering whether Landsat 5 or 7 images can be used with the v1 to get the embeddings. They might not be part of the data module, but what if I have a geotiff for landsat images? Would that work?
Hi @ritwikvashistha yes that should be possible for v1. But v1 is still in development, we are targeting a release in May.
@lauracchen ... there are potentially many oceans-related applications (and funders) that we might want to pursue. Should we consider incorporating an initial attempt for covering oceans for model v1? Or would we just want to create a more oceans-focused use case that falls close to coastlines so that Clay's current geographic coverage with Sentinel would cover it? Or is this just too far out of the scope for now?
Most data sources we use, open data, either don't map open ocean, or do it at very low resolution. On the other side, I would guess that the semantics of ocean surface are marginal additions to the land.
My take is that we can add all with minimal extra effort coastal data available on Sentinel/Landsat, and Modis at low res. I do not know to what degree this will be enough for many applications. I would suspect it can cover most of coastal applications.
Adding within scope open-ocean would be, unless @yellowcap corrects me, a level of effort beyond v1.
For overview, this is the Sentinel coverage, that includes substatially all or part of the continental ocean shelves.
@brunosan thanks! To be clear, does this mean that @yellowcap should update the v1 sampling to include more coastline / continental shelves?
We have Sentinel-2, Landsat, and Naip as of today. Will ad NZ high res and Sentinel-1 next. After that maybe Modis and Sentinel-3 and others.
To run the model on a more diverse universe of data, we need to list and prioritize new sources. An initial list is the following:
SAR
For creating embeddings