davidfrantz commented 1 week ago

The problem

The Sentinel-2 download tool force-level1-csd downloads the images from Google Cloud Storage.

For this, it first downloads a big csv table that holds all the metadata. The filtering of data is then done locally, which has the big advantage of allowing for very complex AOI vectors, and circumvents the usual paging etc. from usual APIs that are around (OData, e.g.).

Unfortunately, this csv table doesn't get updates anymore (August 28, 2024). Data is still ingested into GCS, but the way to retrieve the metadata has been changed, which renders force-level1-csd partially broken (at least for newer data).

I opened an issue on Google's tracker: https://issuetracker.google.com/issues/369223578

Potential solutions

Apparently, the solution is to change the query to use a BigQuery table.

That said, there is some urgent need for

a) an alternative, e.g., developing a new downloader for CDSE, or b) potentially switching to BigQuery

Option a would be quite some effort, and I believe very complex AOI vectors will be difficult to handle. On the other hand, it would be the "official" way of obtaining the data.

Unfortunately, I am not familiar with BigQuery, and how much effort it would be changing to this. I also don't know if there are other downsides to it...

This issue here serves as a discussion on how to proceed, and see what will be the most feasible option - I am also open to other solutions.

I am mentioning some people who I have been in touch with on this topic to include you here: @vudongpham @ernstste @geo-masc

Cheers, David

A note on the CODE-DE Data Cube

PS: for the German Data Cube on CODE-DE, we switched to an ad-hoc solution of scanning the file system for available new L1 data. That said, the CODE-DE datacube is still up-to-date!

ernstste commented 1 week ago

I looked into BigQuery after hearing that the csv files were not going to be updated anymore. The queries are billed by the amount of data that is processed. There is a free tier that provides 1TB/month. While this should be enough for many applications, it will not meet the requirements for all users. An example query for 4 tiles over Germany (full archive, no cc limit) required about 6GB of data. I'm assuming that implementing BigQuery into L1CSD wouldn't be too much of a hassle, but users would have to fiddle with their accounts to activate BigQuery and the limits might be restrictive for some applications. Not the worst solution, but certainly not the best either.

davidfrantz commented 6 days ago

Yeah, I was afraid this was tied to some billing plan. Billing for metadata, though, is strange.

What would happen when the quota is exceeded? Would it just stop working or do the users receive a bill?

And do you know how much data a query for the whole of Germany would be?

geo-masc commented 6 days ago

Hi all,

I must admit that I did not dove to deep into the whole billing issue. But I must say that it was not too complicated to set up a BigQuery project on the GC page. This enabled me to test some initial SQL queries and to create an updated S-2 metadata csv, which has the same structure and information as our "metadata_sentinel2.csv" file.

Together with @felixlobert we then installed a GCS docker file following this documentation, which can be used to call BigQuery on GCS from your machine.

Before you can run it you need to authenticate with you Google account, which is also explained in the documentation. Now, we can run the docker (including the SQL query) using crone and update the metadata file needed force-level1-csd.

Probably this is not the most handy solution (as the initial set up takes some time), the query can for sure be optimized / generalized (now it contains e.g., a hard coded AOI), plus it might be charged for at some point (until now I did not need to share my credit card details). But for now it seems to be a stable solution that works for us and enables to keep the datacube for Germany up to date.

Looking forward to your feedback!

udpate-metadata-csd-bq.txt

vudongpham commented 6 days ago

Hey @geo-masc, you can convert AOI file into WKT string and use that for the query using geopandas and shapely. Here is the script:

import geopandas as gpd
from shapely.ops import transform

def convert_polygon_to_WKT(aoi_path):
    def drop_z(geometry):
        if geometry.has_z:
            return transform(lambda x, y, z=None: (x, y), geometry)
        return geometry

    aoi = gpd.read_file(aoi_path)

    aoi.set_crs(epsg=4326)

    aoi = aoi.dissolve()

    aoi["geometry"] = aoi["geometry"].apply(drop_z)

    aoi['geometry_wkt'] = aoi['geometry'].apply(lambda geom: geom.wkt if geom else None)

    return aoi[['geometry_wkt']].values[0][0]

aoi_path = '/path/to/you/vector_file'
wkt_string = convert_polygon_to_WKT(aoi_path)

In query:

WHERE
...
ST_INTERSECTS(ST_GEOGFROMTEXT(wkt_string))

I am working on script to download from CDSE directly. They just announced that Sentinel-2 data now only come with the newest baseline and the old ones will be deleted (Info here). This could save us from filtering the new baseline by ourselves, and I'm not sure how the data sitiuation will be on the GC.

ernstste commented 5 days ago

One possible workaround would of course be rewriting the force-level1-csd --update functionality to just pull the BigQuery table. That way everything downstream stays the same. Would need a bit of trying/testing, since we made sure our docker images don't have whole google sdk installed but rather just the Python gsutil (i.e., authentication and interfacing with BQ might be a bit more tedious). Otherwise this might be the most straight forward workaround for now.

geo-masc commented 5 days ago

Thanks @vudongpham for this suggestion. I our case this is not necessary, as the AOI (list of MGRS respectively) is defined in L1CSD. We just (spatially) restricted the query to Germany, so that the CSV does not contain metadata for the whole globe.

Sounds great that you are already working on a CDSE extension. Looking forward!

@ernstste this is what we are currently doing but of course with an additional docker. Would be great to include this in L1CSD -u directly.

davidfrantz commented 5 days ago

Hi all,

thanks for chiming in!

@ernstste, it would be really great if you could integrate an approach like @geo-masc and @felixlobert developed. Would you need additional dependencies in the base image? I guess it would also be a good idea to write some sort of warning to stdout regarding the possibility of being billed - or even a mandatory "I know what I am doing" getopt option?

@vudongpham, this sounds awesome. Do you have plans on how to release this when being finished? Would you be okay with us integrating that tool into FORCE when time comes?

Cheers, David

geo-masc commented 5 days ago

Quick update. Apparently, BigQuery also finds some additional data with the defined query (e.g., "S2A_OPER_MSI_L1C_TL_EPA__20170507T091532_A002107_T32UMV_N02.04"). Thus, we adjusted the query so that we now only get the relevant data in the exact format as needed for L1CSD. Maybe something to build on @ernstste. udpate-metadata-csd-bq.txt

vudongpham commented 2 days ago

Hi all,

I create a repo for scripts to search and download from CDSE, have a look and give it a try: https://github.com/vudongpham/CDSE_Sentinel2_downloader

Docker image is available

docker pull vudongpham/cdse-s2

I tried to mimic the landsatlinks commands from @ernstste, though might not be as detailed. @davidfrantz , please, feel free to test and modify it. And I would be very happy you consider integrating it into FORCE ;)

davidfrantz / force

Sentinel-2 downloader does not download images newer than August 28, 2024 #334

The problem

Potential solutions

A note on the CODE-DE Data Cube