Ingestion of common/shared data

sonjastndl commented 5 months ago

Hi everyone,

I am curious about the modalities of the shared s3 bucket on EOX Hub.

If I access data (Example: Requesting a .nc on Surface Map for the year 2019 via the CDS API) at the EOX Hub, I am currently storing that data in our UC specific bucket. I can imagine that happening for other Use Cases and data as well and I am wondering, if we are having a lot of redundant data in the UC specific buckets.

Are there any recommendations on how to handle this from EOX side @Schpidi @eox-cs1? Just copying files to the common s3 might cause a bit of a chaos due to naming and other reasons, that does not avoid redundancy. I assume we will talk about this in our next weeks meeting, still it would be good to know if there are already some foreseen "best practices" and/or restrictions that need to be considered?

@KathiSchleidt @Susannaioni

KathiSchleidt commented 5 months ago

@eox-cs1 @Schpidi we really need feedback on how to structure the S3 Buckets, how can UC partners:

make commonly accessible data available to all UC
keep sensitive data constrained to their UC
access both common and sensitive data via which server?

Related to the issue on server configuration, #64

eox-cs1 commented 5 months ago

* make commonly accessible data available to all UC
all UCs have access to a folder located at /shared/fairicube or /home//.shared/fairicube In (below) this folder all Fairicube UCs can store data which is accessible by all UCs
* keep sensitive data constrained to their UC
every UC has its own s3 bucket associated. This is only accessible by the UC
* access both common and sensitive data via which server?
all UCs have access to a folder located at /shared/fairicube or /home//.shared/fairicube

The respective kernels (to be selected on the beginning of a session) influences the software tools available (as requested by each UC), but has no influence on the availability of the shared/private foldera (see above) (see also https://github.com/FAIRiCUBE/FAIRiCUBE-Hub-issue-tracker/issues/64)

KathiSchleidt commented 5 months ago

@eox-cs1 where is this documented?

BachirNILU commented 5 months ago

Hi @eox-cs1,

Thank you for your responses! I have a very specific question. I have uploaded a data to s3 bucket under UC4 server option. So I understand this bucket is specific to UC4 users. The resources (RAM) was not sufficient to run my code, so I plan to use another sever option (UC1) that has 60 GB of RAM and would be sufficient to run my code. To avoid duplicating the data, I want to run the code after connecting to UC1 server option, and access data in UC4's s3 bucket, How can I do that?

Thanks in advance,

Best regards,

-Bachir.

eox-cs1 commented 5 months ago

A direct cross-UC access is not foreseen - would somehow contradict the separation.

However, you can either use the common /shared/fairicube folder for data exchange/access OR you can use the respective secrets from the desired storage and manually inject them to your environment (in addition to the use-case specific one). This should also give you access to the UC4 s3 bucket from UC1, OR you ask for a larger machine for UC4 (if you will need this more frequently)

sonjastndl commented 5 months ago

@mari-s4e Hey Maria, this is somehow contradictory to where we have been exchanging data right? So for accessing the shared folder none of the actions documented in the FAIRiCUBE Notebook are necessary?

Because data there is not located under shared... This is btw also the issue here. Can someone explain the difference?

BachirNILU commented 5 months ago

Thanks @eox-cs1! I think a direct cross-UC access makes sense with respect to users (limited access). For instance, I am involved in both UC1 and UC4. A cross-UC access to "only" UC1 and UC4 would make sense. This can be adapted to each user. For the suggested options, how can I use option 2 (access UC4 bucket from UC1)? What commands should I run?

Thanks in advance.

Schpidi commented 5 months ago

@BachirNILU I believe in your case the simplest would be to add another profile option to UC4. Note that this per se does not incur costs, only when you run a session there are costs.

If you want to use data from a use case specific bucket at any other place you can retrieve the required details like access keys, etc. from the env variable for example with a command like printenv|grep S3_USER.

BachirNILU commented 5 months ago

Thank you @Schpidi for your response! This is great! If the additional UC4 profile option does not entail additional costs (unless used of course), that works perfectly for us (given that we do not use such memory frequently). A RAM of 60 GB will be great. Thanks again!

Schpidi commented 5 months ago

@sonjastndl sorry, there is a little misunderstanding that I might have caused in the last call.

We offer two types of storage which "File Storage" and "Object Storage" which have slightly different capabilities.

Per default in a JupyterLab session you get your personal workspace as well as a shared folder (/shared/fairicube/ or ~/.shared/fairicube/) which are both persisted on File Storage.

In addition we provide Object Storage to each Use Case separately for example to use with Sentinel Hub. This Object Storage is for convenience mounted to ~/s3 for each user but preferably used via the s3 protocol.

On top of this we were asked to provide shared Object Storage accessible to all Use Cases. This is the fairicube bucket where access keys are shared for usage. This Object Storage is not automatically mounted in the JupyterLab session.

I hope this clarifies your questions.

Schpidi commented 5 months ago

@BachirNILU we'll roll out the required configuration tomorrow and enable a large profile for UC4.

Schpidi commented 5 months ago

@BachirNILU the big UC4 profile (Server Option) is now available

Schpidi commented 5 months ago

With this I believe we can come back to the original question 😉

From a technical point of view the questions to answer are:

Do I need the data only locally, i.e., in JupyterLab? --> Use your local workspace
Do I want to share the data with all users of FAIRiCUBE locally? --> Use the shared folder
Do I want to share the data with all users in my Use Case or do I need external access like via Sentinel Hub services? --> Use the UC bucket
Do I want to share the data with all users of FAIRiCUBE and need external access? --> Use the shared bucket

How to organize data on the bucket doesn't matter from a technical point of view but ,I agree, should be agreed on within a UC team or all FAIRiCUBE users and documented.

KathiSchleidt commented 5 months ago

@Schpidi many thanks for the clarification, but I fear now I understand exactly nothing :(

I read about "File Storage" and "Object Storage" which have slightly different capabilities., but no indication of what these slight differences are

I then find either 3 or 4 options of where to put data:

Per default in a JupyterLab session you get your personal workspace as well as a shared folder (/shared/fairicube/ or ~/.shared/fairicube/) which are both persisted on File Storage.
In addition we provide Object Storage to each Use Case separately for example to use with Sentinel Hub. This Object Storage is for convenience mounted to ~/s3 for each user but preferably used via the s3 protocol.
On top of this we were asked to provide shared Object Storage accessible to all Use Cases. This is the fairicube bucket where access keys are shared for usage. This Object Storage is not automatically mounted in the JupyterLab session.

vs.

Do I need the data only locally, i.e., in JupyterLab? --> Use your local workspace
Do I want to share the data with all users of FAIRiCUBE locally? --> Use the shared folder
Do I want to share the data with all users in my Use Case or do I need external access like via Sentinel Hub services? --> Use the UC bucket
Do I want to share the data with all users of FAIRiCUBE and need external access? --> Use the shared bucket

On common bucket I've found some documentation in the FAIRiBOOK, requirement to install the s3browser

TL;DR; the more I read, the less I see. When can we expect clear documentation on this?

Schpidi commented 5 months ago

@KathiSchleidt it is always 4 options:

Your workspace
Shared folder in workspace
UC bucket
Shared bucket

The different capabilities specific to FiC are also mentioned: "... to use with Sentinel Hub." In general the difference is: "... used via the s3 protocol." from anywhere vs. normal file system only available in JupyterLab.

What are you missing?

KathiSchleidt commented 5 months ago

@Schpidi what I'm missing is a clean description of these various dimensions (objects vs. files, buckets vs filesystem) and options for providing and using this data. I admit I'm exceptionally confused due to my being less active in FAIRiCUBE the last months, but based on discussions with UC partners, seems I'm not the only one.

I have the impression that the applicability of APIs is also somehow related (still waiting on that answer, now close to 5 months waiting :( ), please clarify where we can apply APIs

When can we expect this to be clearly explained in RTD?

BachirNILU commented 5 months ago

@BachirNILU the big UC4 profile (Server Option) is now available

Thanks! It works.

KathiSchleidt commented 2 months ago

@Schpidi am I correct that there is no ambition to document this? I just checked the adding datasets section on RTD, nothing there.

Schpidi commented 2 months ago

Added a first guide on storage to RTD for review either at https://fairicube--8.org.readthedocs.build/en/8/guide/storage/ or FAIRiCUBE/collaboration-platform#8 Happy to read your feedback.

FAIRiCUBE / FAIRiCUBE-Hub-issue-tracker

Ingestion of common/shared data #61