SimonFisher92 / Scottish_Snow

2 stars 0 forks source link

Cacheing logic for Sentinel image downloads #19

Closed ipoole closed 7 months ago

ipoole commented 9 months ago

As download of the image data can take a long time, with frequent pauses while glacial storage is accessed, it would be helpful if the code could skip downloads of data which already exists locally. This would make it more efficient to re-run the downloader.

Couple of questions come to mind, ahead of looking at the download code in detail?

murraycutforth commented 9 months ago

Hi Ian, sorry for the delay replying. Second point is straightforward- the downloader branch was merged into main a while ago. I've just merged the latest changes from main back into it, so you're safe to checkout that branch and start working from there.

On your first point, it is possible to find out the filename before downloading the full file.

Given an id for a particular product (a single tile, at a single time, with various resolutions and bands), where the id is obtained like:

products = api.query(footprint,
                     date=('20151219', date(2015, 12, 29)),
                     platformname='Sentinel-2',
                     cloudcoverpercentage=(0, 30))

You can then get the metadata for this product as a python dict, using:

for prod_id in products:
    metadata_dict = api.get_product_odata(prod_id)

Then within this dict, there is a "title" key. This product is then downloaded into a directory called "title".SAFE. But it gets complicated, because within this directory, the actual imaging data is stored in files like this. And depending on the filters passed to the download function, not all the .jp2 files are necessarily downloaded.

S2B_MSIL2A_20231016T112119_N0509_R037_T30VVJ_20231016T124037.SAFE/GRANULE/L2A_T30VVJ_A034526_20231016T112115/IMG_DATA/R10m/T30VVJ_20231016T112119_B02_10m.jp2

You can see how the time stamps in this path aren't all exactly the same. I'm wary of us trying to re-implement the filename logic, since I think the sentinelsat library already works out this filename (see https://github.com/sentinelsat/sentinelsat/blob/main/sentinelsat/sentinel.py#L598) and when the download function is called (see https://github.com/sentinelsat/sentinelsat/blob/main/sentinelsat/download.py#L42), the library already checks if this path exists, and skips the download if so.

However, in my recent attempts to download the full dataset I've found the code has become stuck, so some work definitely needed here. Will report in a separate issue.

ipoole commented 9 months ago

Hey thanks Murray for this helpful overview. I've made a branch - '19-caching-sentinel-downloads' to work on this. Cheers.

murraycutforth commented 9 months ago

No worries @ipoole . I've made a few commits which are related to this on the downloader branch when I was investigating why my data download would fail, and I've added you as a reviewer for the PR since it's related to this.

What I'm seeing is that when I run the download script now without any "product_filter" then each tile is downloaded as a single zip file, and then the sentinelsat library which we're using to download the tiles seems to correctly handle caching- I've tested that it will skip over existing files, and even continue to finish downloading an incomplete zip file.

ipoole commented 8 months ago

Hey @murraycutforth, thanks for this. I've switched to your latest 'downloader' branch and am running now for the 10m RGB data (filter=*B0[234]_10m.jp). It seems to be running nicely so I'll leave it going while I go watch some crap tv! (I have rather slow internet out here in the sticks!). I'll then check the caching behaviour. It's looking like you might have nailed this issue!

ipoole commented 8 months ago

Correction - running with no product-filter, to get the .zip behaviour you outlined....

murraycutforth commented 8 months ago

Great @ipoole, well all the caching is from the sentinelsat package, nothing to do with me! Glad it's also working for you. Something I'm starting to notice as I actually have a reasonable chunk of data is that the filesizes are wildly different, and it seems like some acquisitions are mostly black, with only a corner of the tile having any data. The square tile which we download isn't always covered by the strip acquired by the satellite at a particular acquisition. I wonder if there's anything in the metadata which tells us this?

Any ideas for useful ways to store this data? @SimonFisher92 and I had discussed applying for a small amount of funding to be able to host it somewhere on the cloud. All of the data for this tile will probably come to a few hundred GB, although maybe we can compress that a lot by masking out everything outside the hills? There will be another 2 (maybe, not checked) tiles to download to cover the west highlands as well.

ipoole commented 8 months ago

I've been running the download (no filter) for over 24 hours and it now seems stuck at 46 download (.zip files) plus one incomplete. @murraycutforth , how many .zip files constitute the full dataset for our area of interest? Btw, I've been running with num-threads=8, which seems to work ok and I believe is more efficient given the "Triggered retrieval from the Long Term Archive..." behaviour.

I see what you mean about the differing .zip files - in my set of 46 I have sizes 18 MB to 1.2 GB! I've had a look at a few images and indeed some of them seem to be black. It would be good to get a handle on this.

Once we have the downloaded .zip files, should we do a one-time expansion, or have code work directly with the .zip files?

ipoole commented 8 months ago

I just tried to run a download again - no filter. After some tracebacks the following message is shown:

The Sentinels Scientific Data Hub

Copernicus Sentinel Data is now available on the Copernicus Data Space

Ecosystem

https://dataspace.copernicus.eu

Is this a recent policy change? I've created an account on the linked site, it does seem useful.

murraycutforth commented 8 months ago

I've just noticed this as well, I've been attempting to download data one year at a time. This succeeded for 2023, but since then I've been getting LTA errors, and now as of today I also see this error!

It seems like the endpoint which we've been using up to now (through the sentinelsat package) has been permanently shut down! It would have been nice if this was signposted better (maybe I just missed it) because it seems like we need to start from scratch again, using one of the APIs which are available through the Copernicus data space.

There's some chat about this on the sentinelsat repo: https://github.com/sentinelsat/sentinelsat/issues/583, https://github.com/sentinelsat/sentinelsat/issues/607

So I think we may need a total rethink of our data downloading approach!

murraycutforth commented 8 months ago

@SimonFisher92 just tagging so you see this as well!

murraycutforth commented 8 months ago

I need to read up on the Copernicus data space a bit more, maybe we can work with it directly. I noticed this package linked in one of the sentinelsat discussions as something which offers a python API for downloading Sentinel data: https://github.com/CS-SI/eodag

murraycutforth commented 8 months ago

Some examples of using python to download S2A data using the "processing" API of SentinelHub (part of the new "Copernicus dataspace" system) are given here: https://documentation.dataspace.copernicus.eu/APIs/SentinelHub/Process/Examples/S2L2A.html

Doesn't look too bad to use..

SimonFisher92 commented 8 months ago

Hey guys, this is unfortunate indeed. Is this a complete API change then? If so, pretty shoddy of them- id like to just run this once a year to get new data and i worry about this kind of thing

ipoole commented 7 months ago

I propose to close this issue as no longer relevant, shout if you disagree.