C-SCALE computing resources: take them or lose them!

sebastian-luna-valero commented 1 year ago

Hi!

This use case currently has the following resources allocated:

Compute: 180 vCPUs at INCD, 128 vCPUs at INFN Cloud Bari
Storage: 18TB at INCD, 50 TB storage at INFN Cloud Bari

Do you think that's enough for the next 9 months? C-SCALE currently has spare capacity that can be allocated now, so if you need to scale up, please let us know asap!

If we don't hear from you by Friday 30th Sept we will assume that you don't need more capacity and we will then reuse the spare capacity in C-SCALE for other use cases.

On the other hand, if you no longer need C-SCALE computing resources, please let us know as well as they will be reused properly.

Best regards, C-SCALE

sebastian-luna-valero commented 1 year ago

cc: @backeb @Jaapel

backeb commented 1 year ago

Hi @sebastian-luna-valero

The original plan was to run the openEO Aquamonitor notebook on INCD compute, accessing data locally on INCD, and then run the notebook again on INCD compute but accessing remote data from somewhere else.

At the moment though, because of the delays in getting the openEO back-end to work on INCD we have been using the VITO openEO back-end. But I believe we are now ready to do the tests using the INCD back-end.

I expect we could use the resources on INCD. Although again it isn't clear how many resources we could consume, it depends on how much data is available to process. We would want to expand to larger areas and longer time series.

For the INFN resources it really depends if they can deploy the openEO back-end for us to use.

enolfc commented 1 year ago

@backeb: What extra requirements are there once INFN deploys openEO? I guess situation will be similar as the one in INCD. What data should be available at that site?

mariojmdavid commented 1 year ago

hi @enolfc at INCD we are now using object storage (ceph S3/Swift) to host the satellite images note that we are reporting accounting about this usage

maricaantonacci commented 1 year ago

Hi all, at INFN we have just installed the openEO platform on a k8s cluster running in the acquamonitor Openstack tenant... Can you please provide instructions on how to pull and store the required datasets?

sebastian-luna-valero commented 1 year ago

Thanks @maricaantonacci !

Looking at previous comments from @jdries via email:

The main and first thing to configure are the collections that the backend wants to expose. This is one example config file, that exposes the public collection at INCD: https://github.com/Open-EO/openeo-geotrellis-kubernetes/blob/master/docker/cscale_layercatalog.json

It's location is configured in this yaml file: https://gitlab.com/lip-computing/openeo/-/blob/main/values.yaml#L34

Maybe @tcassaert and @sustr4 can also help.

Best regards, Sebastian

jdries commented 1 year ago

Hi @maricaantonacci , @sebastian-luna-valero is entirely right, so basically, we'll need some K8S config to get this custom cscale_layercatalog.json inside the spark driver pod, and can then set the location in values.yaml accordingly.

When done correctly, the collections in that json should show up in: https://your.openeo.endpoint/openeo/1.1.0/collections

On the larger issue of pulling in or not pulling in data, there has already been a lot of discussion. There's this open issue that could be helpful for data providers, but I don't think there's plans on moving forward with it: https://jira.egi.eu/browse/CSWP3-29 With the config in that custom file, you basically get what is described here: https://jira.egi.eu/browse/CSWP3-28 Where we hope that the network connection between your datacenter and the data provider is good enough to provide a satisfying user experience. We'll discover that when we run the aquamonitor case on your endpoint while fetching data from INCD or CreoDIAS.

jdries commented 1 year ago

@sebastian-luna-valero for this use case, there's also the possibility to fetch the data on demand from CreoDIAS. Can we also allocate the necessary VA to them, so we can get a keypair that allows us to configure this layer?

enolfc commented 1 year ago

@jdries, would that be using S3? We indeed discussed the possibility to include that one as part of VA, but never really finalised this. If we go this way: would you be able to estimate how much data should be moved? Unfortunately for VA we need to be precise on how many units (TB/month?) we will be make available through the project, so the more accurate this number can be the better.

Then we should agree with @cchatzikyriakou and @LukaszKubowicz on the figures to use for the new installation.

jdries commented 1 year ago

@enolfc To make estimates, we'll want to start from the list of Sentinel-2 L1C products that need to be processed by aquamonitor on INFN. For this, a catalog query can be done if someone knows the area and time range.

Of course, there seems to be an existing estimate of 50TB of storage at INFN. If remote access is used, then local storage is almost nothing, and we could for instance estimate 45TB of data transfer, assuming a single run of aquamonitor over this dataset.

Note that for me it doesn't really matter, but if the local storage option is preferred than of course someone needs to get the data locally. With the remote option we can just configure openEO to read from the remote S3.

backeb commented 1 year ago

I would suggest that INFN get the data for all of Italy, then they can start building up a Sentinel data archive for Italy - I'm sure there would be interested stakeholders!

For the Aquamonitor use case there are (I think) interesting things to look near Venice:

cc @Jaapel @ArjenHaag

Jaapel commented 1 year ago

The gee implementation of aquamonitor uses a large dataset in multiple dimensions:

a large temporal extent (1985 - present)
a large spatial extent (worldwide)
a large number of missions (Landsat 4, 5, 7, 8 and Sentinel 2)

Looking at what we currently have at the VITO endpoint for OpenEO, there is the Terrascope S2 collection which contains a large spatials extent (europe + selected areas, although I do not know what the selected areas are). This collection only contains data from the last 2 years. I have not been able to experiment with harmonisation using different satellite sources. Perhaps we can test scaling to spatial extent one one of the providers and scaling temporally on the other? Just a suggestion, I am open to alternatives.

And indeed @backeb Venice is a great area of interest to show land-to-water changes and vice-versa.

Jaapel commented 1 year ago

I am not sure how difficult harmonization will be, but I'd rather not develop more code this year, and just trying running what I have for S2 on different backends using different S2 sources.

sustr4 commented 1 year ago

I would suggest that INFN get the data for all of Italy

As usual, I'm happy to offer our Relay as the source of fresh Sentinel data.

sebastian-luna-valero commented 1 year ago

I guess we cannot move forward until https://github.com/c-scale-community/use-case-aquamonitor/issues/26 is solved.

c-scale-community / use-case-aquamonitor

C-SCALE computing resources: take them or lose them! #29