backeb commented 2 years ago

The below architecture options could be addressed in individual sprints:

Fully cloud based + boundary data is downloaded (this is the current workflow)
Fully cloud based + boundary data is accessed through provider's datastore (easy access to data) (option to avoid downloading data)
Pre- and post-processing in the cloud + model running on HPC (working on singularity container)
Fully HPC based + boundary data downloaded / accessed The objective at the end is to compare performance and scalability between 4 options.

Sprint 1: test option 1 and develop option 2

Dates: 4 - 8 Oct 2021

Data requirements and specs:

[x] Clarify problem size, data size, performance of workflow (@lorincmeszaros @sandragaytan)
[x] Identify datasets need to fetch and urls (@lorincmeszaros @sandragaytan )
[x] Provide bounding box and period of data downloads (@lorincmeszaros @sandragaytan )

Put info for above 3 tasks here: https://confluence.egi.eu/x/Xx8mBg

Automate data downloads on provider side

~~- [ ] Instantiate another VM and work on automation (@backeb @sandragaytan)~~

[x] Automate data downloads (MOTU clint) to download data every day (@avgils @lorincmeszaros @nikosT @yan0s )
[ ] Set up NFS server (network storage) (@nikosT @yan0s )

Documentation

[ ] Document the commissioning of the data in use case here (@lorincmeszaros @avgils @yan0s @nikosT)

Performance testing

[ ] Compare performance of option 1 and option 2 (@lorincmeszaros )

Document results here: https://confluence.egi.eu/x/Xx8mBg

backeb commented 2 years ago

Update: 6-Oct

Regarding installing Delft3D FM natively on GRNET's HPC, we have decided not to do so and only use the Delft3D FM Singularity container. The reasons for this are
- We have already compared Delft3D FM native install vs Singularity on SURF's HPC in a separate project
- The effort to install Delft3D FM natively on an HPC is not insignificant and we don't see the added value in doing so for this use case. Activity for the providers: Please arrange access for @avgils to your HPC infrastructure so that she can start working on setting up the Singularity container (@ntellgrnet)
We will work on creating a recipe for installing the CMEMS MOTU API & CDS API client and set up a cron job to automatically download the necessary data. Activity for the providers:
- Take the installation recipe (including automatic download cron job) and create a TOSCA (?) template or image for easy redeployment of the CMEMS MOTU API and CDS API client across C-SCALE providers.
- Take the CMEMS MOTU and CDS API client and cron job and adjust it so that it downloads the necessary data to the NFS server once per day. (@yan0s @nikosT)
We will work on generalising the pre- and post-processing containers. To do so we will need workflow tooling, namely:
- Argo (which we can install)
- Kubernetes clusters (which we need the providers to set up).

cc @kkoumantaros @cchatzikyriakou @sebastian-luna-valero @enolfc @sustr4

nikosT commented 2 years ago

Sure, no problem for us to run it in Singularity. If single node (one machine) runs are performed probably similar performance will occur. However, have in mind that no full HPC capabilities are taken into account since Delft3D FM could run on many nodes (MPI standard). Now, if you run the application in many nodes but through Singularity, no good performance will gain since the container has no knowledge of the network layer and network technology (eth., infiniband etc.).

1b. i) @avgils register at https://sram.surf.nl/ to the C-SCALE-test-co and also add your public ssh key there ii) then, send an e-mail to support@hpc.grnet.gr with your IP pool, which will be used to access ARIS

2b. The VM operators/users have root access to perform all these actions. Is there a particular activity that only providers are able to implement?

PS: Data resources (CPU and Storage) per project//use case//user should be defined in the SRAM-LDAP soon.

cc @yan0s @ntellgrnet @kkoumantaros

backeb commented 2 years ago

Sure, no problem for us to run it in Singularity. If single node (one machine) runs are performed probably similar performance will occur. However, have in mind that no full HPC capabilities are taken into account since Delft3D FM could run on many nodes (MPI standard). Now, if you run the application in many nodes but through Singularity, no good performance will gain since the container has no knowledge of the network layer and network technology (eth., infiniband etc.).

Thanks, in discussions yesterday with our team working on Singularity, I was told that there are some performance differences if you use the MPI library inside the Singularity container vs telling the Singularity container to use the MPI library installed on the HPC. The latter requires a configuration step telling the container where the MPI library is installed on the HPC.

Perhaps we can test those two scenarios and evaluate their impact on performance.

2b. The VM operators/users have root access to perform all these actions. Is there a particular activity that only providers are able to implement?

We don't know how to create TOSCA templates or images from the VMs for further redeployment - this is for the providers to do.
And we also don't know how to set up the NFS server.

So if the objective here is to have a TOSCA template / image of the CMEMS MOTU client that downloads to an NFS server accessible by multiple VMs / the login node of an HPC, we can install the CMEMS MOTU client, set up the cron job for automatic downloads and provide that as an example, but you will have to figure out how you want to provide that as a service for other users.

PS: Data resources (CPU and Storage) per project//use case//user should be defined in the SRAM-LDAP soon.

I expect the providers will inform us of the workflow once this is ready?

nikosT commented 2 years ago

One more comment on the Cloud - HPC interconnection: The storage facilities and policies in GRNET declare separate storage between the two infrastructures (Cloud & HPC). For this reason, the data produced from the pre-processing that need to be passed to the Delft3D FM, and vice versa, must be transferred via the SSH protocol. Thus, SSH-like tools should be used (scp, rsync over SSH, etc.).

backeb commented 2 years ago

Sprint 1: Retro

Top: What worked well?

Day in the office was really good
Sprint approach better than scattered working approach
Credit to CMEMS colleagues - customer service is really good!
Was good to collaborate in this way and learn about the model

Tip: What to improve...

Sprints don't really work that well that well on the provider, cannot always dedicate people (incidents on the operational side are the issue)

Sprint progress

https://confluence.egi.eu/pages/viewpage.action?pageId=103161695

Data requirements and specs

Activity completed
Might have to update product name (minor changes) will check
[ ] @nikosT review information provided https://confluence.egi.eu/pages/viewpage.action?pageId=103161695 and indicate what's missing

Automate data downloads on provider side

cron job on VM to download every day the last 5 days of data
confluence pages updated and examples provided
[ ] @avgils check if it works post service being up again. And if it works add second dataset to download. Move example scripts to Github repo.
[ ] We will have to involve INFN in creating TOSCA templates at a later stage

Documentation

Everything documented in https://confluence.egi.eu/display/CSCALE/Use+case%3A+HiSea#Usecase:HiSea-Howtomakeinputdataavailable
Moving examples to Github repo

Performance testing

[ ] Add information about configuration (nodes vs cores, set up, how its run, infrastructure specs) to https://confluence.egi.eu/pages/viewpage.action?pageId=103161695

Objective for next sprint

HPC

Progress towards using HPC
SURF SRAM. How to steps documented in confluence: https://confluence.egi.eu/display/CSCALE/Use+case%3A+HiSea#Usecase:HiSea-Howtogetaccess
[x] @sebastian-luna-valero (SRAM admin) send new invite to @avgils and @sandragaytan to get access to HPC.
[ ] @avgils share IP addresses with support@hpc.grnet.gr cc @sebastian-luna-valero
Objective is to compare performance between different set ups focussing on the model.
Check internal performance problem see notes https://confluence.egi.eu/pages/viewpage.action?pageId=103161695

K8s service and workflow

We need a Kubernetes cluster as a Service: Allows us to use Argo as a workflow manager to connect all steps and chain them together.
If that's not possible we will have to look for another solution.
Need to find someone with expertise to set up Kubernetes cluster and maintain it (the latter is tricky).
No effort allocated for GRNET to maintain the K8s service. GRNET can provide VMs.

General

Consider what services are essential for project - what can be achieved with less effort? Raise at AMB / WP level.

sebastian-luna-valero commented 2 years ago

Hi,

I don't have permissions to edit the post, but here is the link to the "how to get access" for the GRNET HPC so far: https://confluence.egi.eu/display/CSCALE/Use+case%3A+HiSea#Usecase:HiSea-Howtogetaccess

I have also resent the invites to @avgils and @sandragaytan to join the HiSea CO in SRAM.

lorincmeszaros commented 2 years ago

@sebastian-luna-valero : @avgils and I followed the steps, uploaded the public SSH to SRAM profile and emailed the IP (Deltares range) to support@hpc.grnet.gr. Awaiting reply and instructions

sebastian-luna-valero commented 2 years ago

Thanks @lorincmeszaros

Next step is to wait for support@hpc.grnet.gr to confirm access and follow their instructions.

Happy to help over here if you find issues.

c-scale-community / workflow-coastal-hydrowaq

Sprint 1: 4-8 October 2021 #12

Sprint 1: test option 1 and develop option 2

Data requirements and specs:

Automate data downloads on provider side

Documentation

Performance testing

Update: 6-Oct

Sprint 1: Retro

Top: What worked well?

Tip: What to improve...

Sprint progress

Data requirements and specs

Automate data downloads on provider side

Documentation

Performance testing

Objective for next sprint

HPC

K8s service and workflow

General