Data Staging - Githubissues

sustr4 commented 2 years ago

In the kick-off meeting we were discussing the possibility of having data unpacked, cropped and uploaded to a pre-agreed staging location. That's easy -- just give us a specification: what data you want and where? In that context, though, I understand there also need to be proper metadata provided and I need more information on that:

How should the metadata be made available? Local XML (or other format) files on disk? Or does there have to be a network endpoint with an API? Since we are putting the data on a FS, I'd prefer to do the same with metadata. Are there other options?
It will be derived data (cropped if nothing else) so we will have to generate the metadata ourselves. Can you provide templates?

jdries commented 2 years ago

I can answer the metadata part: STAC item metadata would work. Storing it on disk next to the files is fine. As a template, you could have a look at what my colleague Dirk provided to WP2. It's maybe not perfect yet, but would be a start. It's important that assets in the STAC metadata point to the actual data files (jpeg2000 or geotiff).

If Deltares can provide a file with reservoir polygons, you could try a first extraction.

sustr4 commented 2 years ago

It's important that assets in the STAC metadata point to the actual data files (jpeg2000 or geotiff).

Yes yes. I'm trying to explain elsewhere that although this (provisioning of stand-alone, unpacked image files) is not in scope for WP2 services, that does not mean that WP2 does not want to have anything to do with it. We can set up the workflow for you as part the use case, that's fine. We just don't want to own (or advertise) it as a WP2 service but rather as a solution anybody (outside or after C-SCALE) can replicate for their use case.

backeb commented 2 years ago

@Jaapel @jdries we need your help regarding specs for the workflow.

Generally, what we need for this use case is:

Access to Sentinel-2 L1C near real time data for Czech Rep (i.e. the latest images)
Sentinel data needs to be available in OpenEO backend
Access to JRC water occurrence data

Questions from @sustr4

1. Access to Sentinel-2 L1C near real time data for Czech Rep (i.e. the latest images). What images? Surely not globally. You say CZ, so is that "all overlaping tiles"? Or do you have a list of UTM tiles you need? Or geographical coordinates? What bands? Infrared? TCI? Do you need RGB channels separately? Also, it was my understanding that you do not want to get the full 100x100 km images but rather you want the water bodies of interest cropped out. Is that still true?

2. Sentinel data needs to be available in OpenEO backend. I can make sure Sentinel data are accessible somewhere in the cloud. Not sure about OpenEO deplyment but I guess that should be routine for WP3, shouldn't it? What should it even look like, since we were talking about using notebooks? Do we have a notebook instance, an OpenEO machine and a data machine? Or is OpenEO included in the notebook?

3. Access to JRC water occurrence data, that's beyond my expertise but point me there and say what you need done with it. I don't think it will be excessively bulky, so if you need it staged somewhere, I wouldn't oppose doing that 1:1.

sustr4 commented 2 years ago

Thanks, @backeb , for conveying my questions. I have a request in addition to that. I welcome any type of help registering the Sentinel (sub-)data in STAC (Resto). It is still a new protocol to me. But if there's no one comfortable with it among use case representatives, I'll enrol the help from @sreimond-eodc .

sustr4 commented 2 years ago

Finally, I have one more comment to share. Perhaps it should have gone to the top. Please understand that making "just a local copy" of the full image base is contrary to C-SCALE philosophy as well as best practice as I understand it. So if I have to do this local caching, I can justify it in front of myself only by downsizing the data and getting only the subset that's actually going to be used. It seems justifiable to me to extract "Just what do use case needs" and store that locally. That:

Limits storage demands of the use case
Maintains the C-SCALE story about not creating redundancy
Actually speeds up subsequent processing

That's why I pressure all of you into that so much.

Jaapel commented 2 years ago

One addition to Björn post, while developing the algorithm I use a fews months (let's say 3 months) of data (Sentinel-2 L1C) to verify the outcome of the algorithm.

For you last comment; As I am using openeo, I also have to use openeo for querying / loading data. So as an OpenEO user, I query the backend for collections (i.e. connection.list_collections()), and check the collection metadata (i.e. connection.describe_collection("S2_L1C")). So per usecase, we could load in just the extent that I need.

sustr4 commented 2 years ago

So per usecase, we could load in just the extent that I need.

Yes! What is that extent? :-)

Jaapel commented 2 years ago

Global waterwatch is global ;). But for getting this project to work, let's take Czechia as the extent. If we have more time, we can scale up to worldwide and see if the system is still performant. In terms of tiles, let's take all tiles that overlap with Czechias borders. In terms of bands, we use:
```
band_names = ["blue", "green", "red", "nir", "swir", "cloudmask", "cloudp"]
band_codes = ["B02", "B03", "B04", "B08", "B11", "CLM", "CLP"]
```
These should be enough for the analysis and visualization, so I do not expect later additions.

OpenEO backend should be available for my local desktop, where I develop the algorithm (notebook). For the production code, we can see how we want to schedule is. The easiest for me would be to submit a Dockerized workflow to the cluster. I am thinking of new_s2_data --- triggers ---> myscripts --- talks to ---> OpenEO backend --- exports to ---> DataStorage, but this is something we can discuss together.

Feel free to correct me if I am wrong here @jdries , but as far as I know, the openeo backend uses geotrellis in the background, which is a layer on top of Apache Spark. I believe that at VITO (and also at INCD) this is currently deployed on kubernetes. I am sure the team at VITO has some Infrastructure as Code (helm chart?) to help with the kubernetes configuration.
JRC Global Water Occurrence dataset: https://global-surface-water.appspot.com/download We need overlap with the s2 dataset of course. It is one layer, so should not be too large, there is an example python script which shows how to download it.

jdries commented 2 years ago

@sustr4

So normally, you connect openEO directly to the full archive. But it was decided to use this other approach where another process preprocesses the L1C data. This is a lot less user friendly and robust, so not something we would actually want to propose to other users. What you can also try to do, is to register GDAL vsizip style asset urls in the STAC metadata. Maybe that works as well...
Documentation for openEO deployment is here: https://github.com/Open-EO/openeo-geotrellis-kubernetes I recommend to work on data access first before diving into this. We're also considering to somehow further automate this deploy.
Access to JRC Water layer is already there: https://openeo.vito.be/openeo/1.0/collections/GLOBAL_SURFACE_WATER Here again, I would propose to focus on the more critical issues first.

For registering data in STAC, there's some general guidelines about how STAC should look like: https://confluence.egi.eu/pages/viewpage.action?pageId=127729728 But I've also anticipated the need to write catalog ingestion scripts and proper documentation. There's some effort ongoing to try and plan a meeting for that, but I'm also fine with getting started already through github tickets.

sustr4 commented 2 years ago

In terms of bands, we use:

Do you use all resolutions or just the greatest available? Or is everything 20m to go with IR?

Jaapel commented 2 years ago

Do you use all resolutions or just the greatest available? Or is everything 20m to go with IR?

I use the best resolution available (10m for RGB). Cloud Masking bands (CLM, CLP) are much coarser.

valtri commented 2 years ago

I've checked the openeo k8s deployment. It looks like there is some problem with the image "vito-docker.artifactory.vgt.vito.be/openeo-geotrellis:0.1.8"?: Attempting next endpoint for pull after error: manifest unknown: The named manifest is not known to the registry.

(Just experimenting around with kubernetes. I don't have any experience with STAC nor OpenEO, or how to use it with notebooks...)

jdries commented 2 years ago

Indeed, old image, as I said, solve data access issues first, then do openEO deploy...

jdries commented 2 years ago

I've created a first task for data staging: https://github.com/c-scale-community/stac-ingestion/issues/1 And an example notebook that already shows a lot of what needs to happen: https://github.com/c-scale-community/stac-ingestion/blob/main/CScale-PySTAC.ipynb

sustr4 commented 2 years ago

I have one question: It would make sense from EGI/EOSC point of view to store the data in EGI DataHub (OneData). I have had only passing experience with it so far, so it's extra work for me (compared to using yet another network sare) but it would look good in demos. Does anybody have anything to say about it? Encouragement? Warnings? Offers of assistance?

enolfc commented 2 years ago

Does anybody have anything to say about it? Encouragement? Warnings? Offers of assistance?

DataHub is a beast of its own and does not fit every use case. What's the use case for adding DataHub in here? Do you want to use it as some sort of cache for the data? Like only moving the data as you access it locally? The main issue with that is that DataHub needs to index the data also, that may not be what we need here (and depending on the dataset that can be a bottleneck)

jdries commented 2 years ago

Object storage would be best, then you don't depend on being able to mount the share, and you can easily have stac asset links that point to a working location from anywhere.

enolfc commented 2 years ago

Object storage would be best, then you don't depend on being able to mount the share, and you can easily have stac asset links that point to a working location from anywhere.

With Object storage you mean S3? does this support other protocols?

jdries commented 2 years ago

Actually we can generalize to any https url, either public or with a well defined auth mechanism. Most of object storage is not really needed. Only support for http range request is somewhat of a necessity for bigger files: https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests

sustr4 commented 2 years ago

I'm afraid I understand less and less. I was told that having the data locally is a must for OpenEO. Now you tell me you are fine with HTTP? I'm truly confused, although there's a silver lining because I would like it if it were true.

backeb commented 2 years ago

For Aquamonitor we are testing the following:

running the Aquamonitor workflow on INCD compute, accessing data at INCD, but registered in the central STAC catalogue WP2 has set up.
running the Aquamonitor workflow on INCD compute accessing data at CREODIAS

Then comparing the differences in performance.

jdries commented 2 years ago

With 'locally', we generally mean close enough to the compute to achieve a high bandwidth/low access latency. For instance, with AWS, you can access object storage from within the same datacenter (that's then what I call locally), or you can access it from your own laptop (remote). There's large differences between performance and cost in both cases.

As @backeb points out, we're also going to test both cases, but here again, we already know that there's a cost associated with the remote access.

sustr4 commented 2 years ago

OK, so provided the data already are in the same datacenter, and accessible over HTTPS (currently authenticated with username/password, later by OIDC token), that's fine for you?

jdries commented 2 years ago

Yes, should work, of course, the proof is in actually doing it. As a test, you can always ingest some sample data into the central catalog, that would make things more concrete.

sebastian-luna-valero commented 2 years ago

Is there any progress to report on the data staging front?

sebastian-luna-valero commented 2 years ago

As discussed in today's monthly meeting, progress is blocked until https://github.com/c-scale-community/use-case-aquamonitor/issues/26 is solved so we start using the openEO backend at INCD for Aquamonitor and lessons learned are transferred to CESNET for WaterWatch.

c-scale-community / use-case-waterwatch

Data Staging #2