c-scale-community / use-case-aquamonitor

Apache License 2.0
2 stars 1 forks source link

Sprint 2: 15-19 November 2021 #21

Closed backeb closed 2 years ago

backeb commented 2 years ago

Sprint 2: 15-19 November 2021

Data objectives for Aquamonitor:

Sprint activities

Additional notes from sprint planning discussion: https://github.com/c-scale-community/use-case-aquamonitor/issues/19#issuecomment-954682408

cc @gdonvito @jdries @gena @jopina @sebastian-luna-valero @sustr4 @mariojmdavid @miguelviana95 @tiagofglip

backeb commented 2 years ago

Will the providers have to implement a local STAC catalogue for the MDQS anyway? @backeb contact @sustr4

As far as I understand, for the openEO backend to work, the data needs to be, either in a specific format or linked to a local STAC catalogue to deal with the metadata (right @jdries?)

@sustr4 will the providers have to install a local instance of STAC for the Metadata Query Service?

We are trying to understand the relationship of STAC with the openEO backend and the Metadata Query Service.

Essentially the problem we are trying to overcome if the following:

If creating local caches was not envisaged in the project, how would you do the processing on INCD?

jdries commented 2 years ago

Correct, a local data catalog is needed for two reasons:

sustr4 commented 2 years ago

Hi, Bjorn!

@sustr4 will the providers have to install a local instance of STAC for the Metadata Query Service? We are trying to understand the relationship of STAC with the openEO backend and the Metadata Query Service.

MQS is planned as a remote service exposing STAC-API. The plan is to have a failover stup with redundant endopints, but no local component is assumed, at least for my interpretation of "local".

Essentially the problem we are trying to overcome if the following:   We are working on INCD with the openEO backend installed there   We want to use openEO on INCD to process 2 years (2000-2002) of Landsat data for Spain and Portugal  * The data are not (yet in the correct format) on INCD, and openEO assumes local caches (right @jdries?)

My previous question appiles: what does "local" stand for?

If creating local caches was not envisaged in the project, how would you do the processing on INCD?

Creating a limited, area-specific archive for a given use case (or user group) is not only envisoned but even welcome. Sadly, as I mentioned earlier, the use cases are starting quite early in the project, so no solution I can offer presently is ideal:

  1. install DHuS, which supports at least Landsat 8, import data in there, and rely on its OpenSearch API for the time being. STAC interface will be implemented by WP2 later.

  2. wait for data-on-disk solution to be provided by WP2, full with accessors and STAC-API.

I'll be happy elaborat more later, if there are questions. I got to go now :-)

Cheers, Zdeněk

sustr4 commented 2 years ago

 * performance/cost: reading from a remote catalog/object storage would mean that most of your processing time is spent on downloading data.

I don't understand. You need to download the data anyway. It's just whether you download all of it beforehand, or you download "just in time". Plus downloading beforehand incurs local storage cost.

 * The local (STAC) catalog allows openEO to efficiently discover the stored data. Workarounds without a catalog result in a degraded user experience.

Please can you specify what's "Local"?

Zdeněk

jdries commented 2 years ago

@sustr4 there's a few means in which a local catalog can reduce the number of downloads:

  1. most use cases require processing the same dataset multiple times
  2. a datacenter can choose to focus on certain regions and predownload the data, allowing the user to select a provider that already has his data of interest
  3. multiple users may be working on the same datasets, so the download of one user can be reused by other users

Also note that a download can take hours to days rather than seconds/minutes. A lot depends on this, if someone can show a very fast and cheap download of EO data archives for countries and continents, we should perhaps reconsider.

With 'local', I mean that the data is stored close enough to the processing system to allow fast access, similar to what you get when reading from network drives or object storage in the same datacenter. The actual catalog can be anywhere, as long as it can return links to the 'local' data.

Happy to hear that you're considering a 'data-on-disk' solution, I hope to join the next WP2 meeting so I can learn more about it!

sustr4 commented 2 years ago

   1. most use cases require processing the same dataset multiple times

Agreed, but as far as I can tell we don't have those use cases in C- SCALE. It's an analytics platform that tends to run anaylisis once over and be done.

   2. a datacenter can choose to focus on certain regions and predownload the data, allowing the user to select a provider that already has his data of interest

Perfect! Let them become a member of the C-SCALE Data federation, then. They're welcome.

   3. multiple users may be working on the same datasets, so the download of one user can be reused by other users

True, but it does not apply to C-SCALE use cases, which have been assigned to different compute providers regardless of data reuse anyway. So not a problem for today. We know we will have a solution by the end of the project, why push it now?

Also note that a download can take hours to days rather than seconds/minutes. A lot depends on this, if someone can show a very fast and cheap download of EO data archives for countries and continents, we should perhaps reconsider.

Sorry, I consider this an artificial reason. If a product takes days to download, how are you even going to fill a meaningful local cache.

With 'local', I mean that the data is stored close enough to the processing system to allow fast access, similar to what you get when reading from network drives or object storage in the same datacenter. The actual catalog can be anywhere, as long as it can return links to the 'local' data.

Fine, so if the "local" solution is not provided until later, it may slow the use cases down but not block them altogether, right? That buys us time :-) Also most of us in the data federation do not think our networks are slow :-)

Happy to hear that you're considering a 'data-on-disk' solution, I hope to join the next WP2 meeting so I can learn more about it!

It's been part of the design from the beginning. Some partners in the data federation, e.g. EODC, have that setup at home.

Zdeněk

backeb commented 2 years ago

Thanks @sustr4 and @jdries for the input.

True, but it does not apply to C-SCALE use cases, which have been assigned to different compute providers regardless of data reuse anyway. So not a problem for today. We know we will have a solution by the end of the project, why push it now?

The use cases should report on whether or not solutions are fit for purpose, so if the solution is only ready at the end of the project, how are we going to test it?

To progress on this use case, I suggest that INCD deploy a local STAC catalogue for the data they have downloaded, so that we can test the openEO backend on INCD.

@mariojmdavid @zbenta @jopina @miguelviana95 @tiagofglip do you agree?

sustr4 commented 2 years ago

The use cases should report on whether or not solutions are fit for purpose, so if the solution is only ready at the end of the project, how are we going to test it?

I wrote "By the end of the project" since I considered it safe. Actually, end of 2022 is worst case and even that gives us months for evaluation.

Zdeněk

jdries commented 2 years ago

I think this is probably the key thing we disagree on:

we don't have those use cases in C-SCALE. It's an analytics platform that tends to run analysis once over and be done.

In my opinion, C-Scale wants to support the full lifecycle from R&D to data production, and in that case, users really need to do quite a few iterations, on increasingly large datasets. Very often, a run on a full dataset (like a country, continent, or global) also reveals issues that were not visible at small scale runs, which triggers even a rerun on the largest scale. The case of just running something once is not something I see happening a lot, but maybe I'm biased in some respect. So I really think we should somehow discuss this point, perhaps with other stakeholders.

Note that with respect to planning, from my side it's fine to wait a bit, as long as we have enough time left to integrate and provide feedback on what wp2 eventually provides.

mariojmdavid commented 2 years ago

pragmatically, we will deploy the local STAC metadata catalog this will allow the users/this use case to advance we will discuss internally the possibility to be part of the data federation which was not foreseen in the project, but we have come this far, so it may make sense and besides, we are very interested in having this for national purposes

this does not in anyway hinders the developments in WP2, and afaik, the use case will be prepared for it when it comes

backeb commented 2 years ago

If we do the same with INFN:

We could, in this way, start building a distributed landsat data archive in Europe attached to a processing backend (openEO) and FAIR via the Metadata Query Service. I expect a broader community beyond the project use cases would benefit from this. And it could pave the way for distributed archives. Maybe? @sustr4 @mariojmdavid @gdonvito

sustr4 commented 2 years ago

   Maybe? @.*** @mariojmdavid @gdonvito

You know, all this is "very nice to have". And everybody wants to have nice things, I agree. But we have promised (and named our project after) Copernicus so setting crucial resources aside to cater for LandSat in this very nice but somewhat underfunded project is to be seriously considered.

Just my $0,02.

Zdeněk

backeb commented 2 years ago

   Maybe? @.*** @mariojmdavid @gdonvito You know, all this is "very nice to have". And everybody wants to have nice things, I agree. But we have promised (and named our project after) Copernicus so setting crucial resources aside to cater for LandSat in this very nice but somewhat underfunded project is to be seriously considered. Just my $0,02. Zdeněk

Fair comment! The landsat data is valuable though especially for longer timeseries analysis. And if we can combine landsat data with sentinel would be really valuable.

We have other use cases as well, so the Copernicus will be taken care of.

backeb commented 2 years ago

Sprint 2 retro

Top: What worked well?

Will the providers have to implement a local STAC catalogue for the MDQS anyway?

Objectives for next sprint

Date of next sprint

AOB

zbenta commented 2 years ago

@jdries when can we schedule a meeting? We already have the stac server and the openeo endpoint available. We just need to take care of the integration of the stac catalog into the openeo service.

jdries commented 2 years ago

Great, would thursday afternoon work? If you have a url for the stac service, could you forward it? Then I can start by having a look.

zbenta commented 2 years ago

Hi @jdries, Thursday is not a good day for us(national holliday), we are available on Friday 3rd(whole day), Monday 6th (morning) Tuesday 7th (whole day) and Thursday 9th (morning) and Friday 10th (I'm in love).

jdries commented 2 years ago

Friday the 3rd, 2PM?

zbenta commented 2 years ago

Sure, just tell us what time zone are you using :-D

jdries commented 2 years ago

Brussels time, shall I create an invite, that should also help with time zones? Just let me know who to send it to.

v-miguel commented 2 years ago

Hello Jeroen,

Please use this zoom meeting link: https://videoconf-colibri.zoom.us/j/81201778130?pwd=U1F0bS91QjZ4SzRBRm9XeGFydkhvZz09 https://videoconf-colibri.zoom.us/j/81201778130?pwd=U1F0bS91QjZ4SzRBRm9XeGFydkhvZz09

We are already here.

On 12/1/21 06:39, Jeroen Dries wrote:

Brussels time, shall I create an invite, that should also help with time zones?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/c-scale-community/use-case-aquamonitor/issues/21#issuecomment-983334919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AU27SOFM2ZN4TXRIAQPLXOTUOW7IXANCNFSM5G7FICIQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Cumprimentos / Best Regards,

Miguel Viana INCD @ LIP - Universidade do Minho

INCD Logo

zbenta commented 2 years ago

new meting url:

https://videoconf-colibri.zoom.us/j/81386195303?pwd=ajhDdG1VcmM0b2JDNlBJWTZ6S1Q2Zz09

Cumprimentos / Best Regards, Zacarias Benta

On Fri, Dec 3, 2021 at 1:05 PM miguelviana95 @.***> wrote:

Hello Jeroen,

Please use this zoom meeting link:

https://videoconf-colibri.zoom.us/j/81201778130?pwd=U1F0bS91QjZ4SzRBRm9XeGFydkhvZz09 < https://videoconf-colibri.zoom.us/j/81201778130?pwd=U1F0bS91QjZ4SzRBRm9XeGFydkhvZz09

We are already here.

On 12/1/21 06:39, Jeroen Dries wrote:

Brussels time, shall I create an invite, that should also help with time zones?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/c-scale-community/use-case-aquamonitor/issues/21#issuecomment-983334919>,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/AU27SOFM2ZN4TXRIAQPLXOTUOW7IXANCNFSM5G7FICIQ . Triage notifications on the go with GitHub Mobile for iOS < https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>

or Android < https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

Cumprimentos / Best Regards,

Miguel Viana INCD @ LIP - Universidade do Minho

INCD Logo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/c-scale-community/use-case-aquamonitor/issues/21#issuecomment-985502808, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM7L5R3OSEXKMGDFUYFXCGLUPC6B7ANCNFSM5G7FICIQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

v-miguel commented 2 years ago

Hello all,

Regarding the local STAC catalogue, as long as it is available through stac.a.incd.pt or stac-browser.a.incd.pt for everyone, maybe we should concern about some points:

We recognize the need for access to the stac catalog for debugging purposes by the project developers and therefore, a possible solution would be to restrict access to stac.a.incd.pt and stac-browser.a.incd.pt to only specific IPs.