backeb commented 2 years ago

Sprint 2: 15-19 November 2021

Data objectives for Aquamonitor:

Start with Spain and Portugal demo, need 2 years of data (e.g. 2000-2002) = MVP
Then download all Landsat data for Spain and Portugal so that others benefit from it

Sprint activities

[ ] Investigate and evaluate how to set up a local STAC catalogue (@zbenta ++)
[ ] Use openEO platform API together with current Notebook (❓) workflow for a larger area (@Jaapel) and report findings to @jdries ++
[ ] Question: Will the providers have to implement a local STAC catalogue for the MDQS anyway? @backeb contact @sustr4

Additional notes from sprint planning discussion: https://github.com/c-scale-community/use-case-aquamonitor/issues/19#issuecomment-954682408

cc @gdonvito @jdries @gena @jopina @sebastian-luna-valero @sustr4 @mariojmdavid @miguelviana95 @tiagofglip

backeb commented 2 years ago

Will the providers have to implement a local STAC catalogue for the MDQS anyway? @backeb contact @sustr4

As far as I understand, for the openEO backend to work, the data needs to be, either in a specific format or linked to a local STAC catalogue to deal with the metadata (right @jdries?)

@sustr4 will the providers have to install a local instance of STAC for the Metadata Query Service?

We are trying to understand the relationship of STAC with the openEO backend and the Metadata Query Service.

Essentially the problem we are trying to overcome if the following:

We are working on INCD with the openEO backend installed there
We want to use openEO on INCD to process 2 years (2000-2002) of Landsat data for Spain and Portugal
The data are not (yet in the correct format) on INCD, and openEO assumes local caches (right @jdries?)

If creating local caches was not envisaged in the project, how would you do the processing on INCD?

jdries commented 2 years ago

Correct, a local data catalog is needed for two reasons:

performance/cost: reading from a remote catalog/object storage would mean that most of your processing time is spent on downloading data. (The CreoDIAS catalog is one such example, and also has a cost associated with doing S3 requests.)
The local (STAC) catalog allows openEO to efficiently discover the stored data. Workarounds without a catalog result in a degraded user experience.

sustr4 commented 2 years ago

Hi, Bjorn!

@sustr4 will the providers have to install a local instance of STAC for the Metadata Query Service? We are trying to understand the relationship of STAC with the openEO backend and the Metadata Query Service.

MQS is planned as a remote service exposing STAC-API. The plan is to have a failover stup with redundant endopints, but no local component is assumed, at least for my interpretation of "local".

Essentially the problem we are trying to overcome if the following: We are working on INCD with the openEO backend installed there We want to use openEO on INCD to process 2 years (2000-2002) of Landsat data for Spain and Portugal * The data are not (yet in the correct format) on INCD, and openEO assumes local caches (right @jdries?)

My previous question appiles: what does "local" stand for?

If creating local caches was not envisaged in the project, how would you do the processing on INCD?

Creating a limited, area-specific archive for a given use case (or user group) is not only envisoned but even welcome. Sadly, as I mentioned earlier, the use cases are starting quite early in the project, so no solution I can offer presently is ideal:

install DHuS, which supports at least Landsat 8, import data in there, and rely on its OpenSearch API for the time being. STAC interface will be implemented by WP2 later.
wait for data-on-disk solution to be provided by WP2, full with accessors and STAC-API.

I'll be happy elaborat more later, if there are questions. I got to go now :-)

Cheers, Zdeněk

sustr4 commented 2 years ago

* performance/cost: reading from a remote catalog/object storage would mean that most of your processing time is spent on downloading data.

I don't understand. You need to download the data anyway. It's just whether you download all of it beforehand, or you download "just in time". Plus downloading beforehand incurs local storage cost.

* The local (STAC) catalog allows openEO to efficiently discover the stored data. Workarounds without a catalog result in a degraded user experience.

Please can you specify what's "Local"?

Zdeněk

jdries commented 2 years ago

@sustr4 there's a few means in which a local catalog can reduce the number of downloads:

most use cases require processing the same dataset multiple times
a datacenter can choose to focus on certain regions and predownload the data, allowing the user to select a provider that already has his data of interest
multiple users may be working on the same datasets, so the download of one user can be reused by other users

Also note that a download can take hours to days rather than seconds/minutes. A lot depends on this, if someone can show a very fast and cheap download of EO data archives for countries and continents, we should perhaps reconsider.

With 'local', I mean that the data is stored close enough to the processing system to allow fast access, similar to what you get when reading from network drives or object storage in the same datacenter. The actual catalog can be anywhere, as long as it can return links to the 'local' data.

Happy to hear that you're considering a 'data-on-disk' solution, I hope to join the next WP2 meeting so I can learn more about it!

sustr4 commented 2 years ago

1. most use cases require processing the same dataset multiple times

Agreed, but as far as I can tell we don't have those use cases in C- SCALE. It's an analytics platform that tends to run anaylisis once over and be done.

2. a datacenter can choose to focus on certain regions and predownload the data, allowing the user to select a provider that already has his data of interest

Perfect! Let them become a member of the C-SCALE Data federation, then. They're welcome.

3. multiple users may be working on the same datasets, so the download of one user can be reused by other users

True, but it does not apply to C-SCALE use cases, which have been assigned to different compute providers regardless of data reuse anyway. So not a problem for today. We know we will have a solution by the end of the project, why push it now?

Also note that a download can take hours to days rather than seconds/minutes. A lot depends on this, if someone can show a very fast and cheap download of EO data archives for countries and continents, we should perhaps reconsider.

Sorry, I consider this an artificial reason. If a product takes days to download, how are you even going to fill a meaningful local cache.

With 'local', I mean that the data is stored close enough to the processing system to allow fast access, similar to what you get when reading from network drives or object storage in the same datacenter. The actual catalog can be anywhere, as long as it can return links to the 'local' data.

Fine, so if the "local" solution is not provided until later, it may slow the use cases down but not block them altogether, right? That buys us time :-) Also most of us in the data federation do not think our networks are slow :-)

Happy to hear that you're considering a 'data-on-disk' solution, I hope to join the next WP2 meeting so I can learn more about it!

It's been part of the design from the beginning. Some partners in the data federation, e.g. EODC, have that setup at home.

Zdeněk

backeb commented 2 years ago

Thanks @sustr4 and @jdries for the input.

True, but it does not apply to C-SCALE use cases, which have been assigned to different compute providers regardless of data reuse anyway. So not a problem for today. We know we will have a solution by the end of the project, why push it now?

The use cases should report on whether or not solutions are fit for purpose, so if the solution is only ready at the end of the project, how are we going to test it?

To progress on this use case, I suggest that INCD deploy a local STAC catalogue for the data they have downloaded, so that we can test the openEO backend on INCD.

@mariojmdavid @zbenta @jopina @miguelviana95 @tiagofglip do you agree?

sustr4 commented 2 years ago

The use cases should report on whether or not solutions are fit for purpose, so if the solution is only ready at the end of the project, how are we going to test it?

I wrote "By the end of the project" since I considered it safe. Actually, end of 2022 is worst case and even that gives us months for evaluation.

Zdeněk

jdries commented 2 years ago

I think this is probably the key thing we disagree on:

we don't have those use cases in C-SCALE. It's an analytics platform that tends to run analysis once over and be done.

In my opinion, C-Scale wants to support the full lifecycle from R&D to data production, and in that case, users really need to do quite a few iterations, on increasingly large datasets. Very often, a run on a full dataset (like a country, continent, or global) also reveals issues that were not visible at small scale runs, which triggers even a rerun on the largest scale. The case of just running something once is not something I see happening a lot, but maybe I'm biased in some respect. So I really think we should somehow discuss this point, perhaps with other stakeholders.

Note that with respect to planning, from my side it's fine to wait a bit, as long as we have enough time left to integrate and provide feedback on what wp2 eventually provides.

mariojmdavid commented 2 years ago

pragmatically, we will deploy the local STAC metadata catalog this will allow the users/this use case to advance we will discuss internally the possibility to be part of the data federation which was not foreseen in the project, but we have come this far, so it may make sense and besides, we are very interested in having this for national purposes

this does not in anyway hinders the developments in WP2, and afaik, the use case will be prepared for it when it comes

backeb commented 2 years ago

If we do the same with INFN:

deploy openEO backend
download landsat data for Italy
pilot aquamonitor scaling from INCD processing Spain and Portugal data to INFN processing Italy data

We could, in this way, start building a distributed landsat data archive in Europe attached to a processing backend (openEO) and FAIR via the Metadata Query Service. I expect a broader community beyond the project use cases would benefit from this. And it could pave the way for distributed archives. Maybe? @sustr4 @mariojmdavid @gdonvito

sustr4 commented 2 years ago

Maybe? @.*** @mariojmdavid @gdonvito

You know, all this is "very nice to have". And everybody wants to have nice things, I agree. But we have promised (and named our project after) Copernicus so setting crucial resources aside to cater for LandSat in this very nice but somewhat underfunded project is to be seriously considered.

Just my $0,02.

Zdeněk

backeb commented 2 years ago

Maybe? @.*** @mariojmdavid @gdonvito You know, all this is "very nice to have". And everybody wants to have nice things, I agree. But we have promised (and named our project after) Copernicus so setting crucial resources aside to cater for LandSat in this very nice but somewhat underfunded project is to be seriously considered. Just my $0,02. Zdeněk

Fair comment! The landsat data is valuable though especially for longer timeseries analysis. And if we can combine landsat data with sentinel would be really valuable.

We have other use cases as well, so the Copernicus will be taken care of.

backeb commented 2 years ago

Sprint 2 retro

Top: What worked well?

Breakthroughs regarding STAC
Tip: What to improve?
...
Sprint progress

Investigate and evaluate how to set up a local STAC catalogue (@zbenta ++)
...
Use openEO platform API together with current Notebook (❓) workflow for a larger area (@Jaapel) and report findings to @jdries ++
Registered for openEO platform
Catching up next week on this task

Will the providers have to implement a local STAC catalogue for the MDQS anyway?

INCD will deploy a local STAC catalogue (https://github.com/c-scale-community/use-case-aquamonitor/issues/21#issuecomment-964263037)
Breakthroughs regarding STAC
Implemented EODAG STAC catalogue
Configured in Docker
Configured for CREODIAS provider
Can search and find collections and select period and area
Uses its own internal cache for queries ❓ how to tell openEO backend that INCD has a local STAC catalogue
- openEO has a config file where we can define endpoints
- openEO needs it unzipped (don't need zip files, most users use unzipped data)
- how will openEO map the collections in the docker image in the openEO endpoint? Action for @jdries
- rebuilding openEO kubernetes cluster - @zbenta will give @jdries the IP address where to connect and figure this out
Note: STAC metadata you et back from the different providers can sometimes be different

Objectives for next sprint

@jdries and @zbenta ++ to meet and configure the layer based on the catalogue
Demonstrate that synchronous jobs work
Maybe set up a ZooKeeper that keeps track of db state (needed for batch jobs) -- not this sprint

Date of next sprint

6-10 December 2021

AOB

We are finding that compute providers are also becoming part of the data federation
13-14 December Copernicus meeting in Portugal, abstract submitted - hopefully get presentation!

zbenta commented 2 years ago

@jdries when can we schedule a meeting? We already have the stac server and the openeo endpoint available. We just need to take care of the integration of the stac catalog into the openeo service.

jdries commented 2 years ago

Great, would thursday afternoon work? If you have a url for the stac service, could you forward it? Then I can start by having a look.

zbenta commented 2 years ago

Hi @jdries, Thursday is not a good day for us(national holliday), we are available on Friday 3rd(whole day), Monday 6th (morning) Tuesday 7th (whole day) and Thursday 9th (morning) and Friday 10th (I'm in love).

jdries commented 2 years ago

Friday the 3rd, 2PM?

zbenta commented 2 years ago

Sure, just tell us what time zone are you using :-D

jdries commented 2 years ago

Brussels time, shall I create an invite, that should also help with time zones? Just let me know who to send it to.

v-miguel commented 2 years ago

Hello Jeroen,

Please use this zoom meeting link: https://videoconf-colibri.zoom.us/j/81201778130?pwd=U1F0bS91QjZ4SzRBRm9XeGFydkhvZz09 https://videoconf-colibri.zoom.us/j/81201778130?pwd=U1F0bS91QjZ4SzRBRm9XeGFydkhvZz09

We are already here.

On 12/1/21 06:39, Jeroen Dries wrote:

Brussels time, shall I create an invite, that should also help with time zones?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/c-scale-community/use-case-aquamonitor/issues/21#issuecomment-983334919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AU27SOFM2ZN4TXRIAQPLXOTUOW7IXANCNFSM5G7FICIQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Cumprimentos / Best Regards,

Miguel Viana INCD @ LIP - Universidade do Minho

INCD Logo

zbenta commented 2 years ago

new meting url:

https://videoconf-colibri.zoom.us/j/81386195303?pwd=ajhDdG1VcmM0b2JDNlBJWTZ6S1Q2Zz09

Cumprimentos / Best Regards, Zacarias Benta

On Fri, Dec 3, 2021 at 1:05 PM miguelviana95 @.***> wrote:

Hello Jeroen,

Please use this zoom meeting link:

https://videoconf-colibri.zoom.us/j/81201778130?pwd=U1F0bS91QjZ4SzRBRm9XeGFydkhvZz09 < https://videoconf-colibri.zoom.us/j/81201778130?pwd=U1F0bS91QjZ4SzRBRm9XeGFydkhvZz09

We are already here.

On 12/1/21 06:39, Jeroen Dries wrote:

Brussels time, shall I create an invite, that should also help with time zones?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/c-scale-community/use-case-aquamonitor/issues/21#issuecomment-983334919>,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/AU27SOFM2ZN4TXRIAQPLXOTUOW7IXANCNFSM5G7FICIQ . Triage notifications on the go with GitHub Mobile for iOS < https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>

or Android < https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

Cumprimentos / Best Regards,

Miguel Viana INCD @ LIP - Universidade do Minho

INCD Logo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/c-scale-community/use-case-aquamonitor/issues/21#issuecomment-985502808, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM7L5R3OSEXKMGDFUYFXCGLUPC6B7ANCNFSM5G7FICIQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

v-miguel commented 2 years ago

Hello all,

Regarding the local STAC catalogue, as long as it is available through stac.a.incd.pt or stac-browser.a.incd.pt for everyone, maybe we should concern about some points:

Since stac.a.incd.pt is indexed by some search engines, we could get data requests from users who are not part of this project.
The download of irrelevant data from users who are not part of this project could destroy relevant data for the project already cached.
By making the data obtained from creodias (for example) available to everyone, we may be going against the license to use the data imposed by the creodias provider. (I'm not 100% sure if this point applies)

We recognize the need for access to the stac catalog for debugging purposes by the project developers and therefore, a possible solution would be to restrict access to stac.a.incd.pt and stac-browser.a.incd.pt to only specific IPs.

c-scale-community / use-case-aquamonitor