c-scale-community / workflow-coastal-hydrowaq

Porting and deploying the HiSea use case on C-SCALE
Apache License 2.0
3 stars 1 forks source link

Sprint 1: 4-8 October 2021 #12

Closed backeb closed 2 years ago

backeb commented 2 years ago

The below architecture options could be addressed in individual sprints:

  1. Fully cloud based + boundary data is downloaded (this is the current workflow)
  2. Fully cloud based + boundary data is accessed through provider's datastore (easy access to data) (option to avoid downloading data)
  3. Pre- and post-processing in the cloud + model running on HPC (working on singularity container)
  4. Fully HPC based + boundary data downloaded / accessed The objective at the end is to compare performance and scalability between 4 options.

Sprint 1: test option 1 and develop option 2

Dates: 4 - 8 Oct 2021

Data requirements and specs:

Put info for above 3 tasks here: https://confluence.egi.eu/x/Xx8mBg

Automate data downloads on provider side

- [ ] Instantiate another VM and work on automation (@backeb @sandragaytan)

Documentation

Performance testing

Document results here: https://confluence.egi.eu/x/Xx8mBg

backeb commented 2 years ago

Update: 6-Oct

  1. Regarding installing Delft3D FM natively on GRNET's HPC, we have decided not to do so and only use the Delft3D FM Singularity container. The reasons for this are

    • We have already compared Delft3D FM native install vs Singularity on SURF's HPC in a separate project
    • The effort to install Delft3D FM natively on an HPC is not insignificant and we don't see the added value in doing so for this use case. Activity for the providers: Please arrange access for @avgils to your HPC infrastructure so that she can start working on setting up the Singularity container (@ntellgrnet)
  2. We will work on creating a recipe for installing the CMEMS MOTU API & CDS API client and set up a cron job to automatically download the necessary data. Activity for the providers:

    • Take the installation recipe (including automatic download cron job) and create a TOSCA (?) template or image for easy redeployment of the CMEMS MOTU API and CDS API client across C-SCALE providers.
    • Take the CMEMS MOTU and CDS API client and cron job and adjust it so that it downloads the necessary data to the NFS server once per day. (@yan0s @nikosT)
  3. We will work on generalising the pre- and post-processing containers. To do so we will need workflow tooling, namely:

    • Argo (which we can install)
    • Kubernetes clusters (which we need the providers to set up).

cc @kkoumantaros @cchatzikyriakou @sebastian-luna-valero @enolfc @sustr4

nikosT commented 2 years ago
  1. Sure, no problem for us to run it in Singularity. If single node (one machine) runs are performed probably similar performance will occur. However, have in mind that no full HPC capabilities are taken into account since Delft3D FM could run on many nodes (MPI standard). Now, if you run the application in many nodes but through Singularity, no good performance will gain since the container has no knowledge of the network layer and network technology (eth., infiniband etc.).

1b. i) @avgils register at https://sram.surf.nl/ to the C-SCALE-test-co and also add your public ssh key there ii) then, send an e-mail to support@hpc.grnet.gr with your IP pool, which will be used to access ARIS

2b. The VM operators/users have root access to perform all these actions. Is there a particular activity that only providers are able to implement?

PS: Data resources (CPU and Storage) per project//use case//user should be defined in the SRAM-LDAP soon.

cc @yan0s @ntellgrnet @kkoumantaros

backeb commented 2 years ago
  1. Sure, no problem for us to run it in Singularity. If single node (one machine) runs are performed probably similar performance will occur. However, have in mind that no full HPC capabilities are taken into account since Delft3D FM could run on many nodes (MPI standard). Now, if you run the application in many nodes but through Singularity, no good performance will gain since the container has no knowledge of the network layer and network technology (eth., infiniband etc.).

Thanks, in discussions yesterday with our team working on Singularity, I was told that there are some performance differences if you use the MPI library inside the Singularity container vs telling the Singularity container to use the MPI library installed on the HPC. The latter requires a configuration step telling the container where the MPI library is installed on the HPC.

Perhaps we can test those two scenarios and evaluate their impact on performance.

2b. The VM operators/users have root access to perform all these actions. Is there a particular activity that only providers are able to implement?

  1. We don't know how to create TOSCA templates or images from the VMs for further redeployment - this is for the providers to do.
  2. And we also don't know how to set up the NFS server.

So if the objective here is to have a TOSCA template / image of the CMEMS MOTU client that downloads to an NFS server accessible by multiple VMs / the login node of an HPC, we can install the CMEMS MOTU client, set up the cron job for automatic downloads and provide that as an example, but you will have to figure out how you want to provide that as a service for other users.

PS: Data resources (CPU and Storage) per project//use case//user should be defined in the SRAM-LDAP soon.

I expect the providers will inform us of the workflow once this is ready?

nikosT commented 2 years ago

One more comment on the Cloud - HPC interconnection: The storage facilities and policies in GRNET declare separate storage between the two infrastructures (Cloud & HPC). For this reason, the data produced from the pre-processing that need to be passed to the Delft3D FM, and vice versa, must be transferred via the SSH protocol. Thus, SSH-like tools should be used (scp, rsync over SSH, etc.).

backeb commented 2 years ago

Sprint 1: Retro

Top: What worked well?

Tip: What to improve...

Sprint progress

https://confluence.egi.eu/pages/viewpage.action?pageId=103161695

Data requirements and specs

Automate data downloads on provider side

Documentation

Performance testing

Objective for next sprint

HPC

K8s service and workflow

General

sebastian-luna-valero commented 2 years ago

Hi,

I don't have permissions to edit the post, but here is the link to the "how to get access" for the GRNET HPC so far: https://confluence.egi.eu/display/CSCALE/Use+case%3A+HiSea#Usecase:HiSea-Howtogetaccess

I have also resent the invites to @avgils and @sandragaytan to join the HiSea CO in SRAM.

lorincmeszaros commented 2 years ago

@sebastian-luna-valero : @avgils and I followed the steps, uploaded the public SSH to SRAM profile and emailed the IP (Deltares range) to support@hpc.grnet.gr. Awaiting reply and instructions

sebastian-luna-valero commented 2 years ago

Thanks @lorincmeszaros

Next step is to wait for support@hpc.grnet.gr to confirm access and follow their instructions.

Happy to help over here if you find issues.