eth-cscs / manta

Another CLI for Alps
https://eth-cscs.github.io/manta/
BSD 3-Clause "New" or "Revised" License
14 stars 4 forks source link

FEATURE: Add support for jinja2 templating on SAT file #31

Closed Masber closed 8 months ago

Masber commented 8 months ago

SAT file to deploy clusters is currently a static file and we would like to add support so we could use a jinja2 template features. An example would be something like:

manta apply cluster -f <SAT file> --session-vars <session vars file>

With SAT file being:

# (C) Copyright 2022-2023 Hewlett Packard Enterprise Development LP
---
schema_version: 1.0.2
configurations:
- name: "{{default.note}}-compute-config-{{default.suffix}}"
  layers:
# The gpu_customize_driver_playbook.yml playbook will install GPU driver and
# SDK/toolkit software into the compute boot image if GPU content is available
# in the expected Nexus repo targets. If GPU content has not been uploaded to
# Nexus this play will be skipped automatically. If GPU content is available in
# Nexus but a non-gpu image is wanted this layer can be commented out.
#BEGIN_GPU_SUPPORT
  - name: uss-gpu-customize-driver-playbook-{{uss.working_branch}}
    playbook: gpu_customize_driver_playbook.yml
    product:
      name: uss
      version: "{{uss.version}}"
      branch: "{{uss.working_branch}}"
    special_parameters:
      ims_require_dkms: true
#END_GPU_SUPPORT
  - name: shs-{{default.network_type}}_install-{{slingshot_host_software.working_branch}}
    playbook: shs_{{default.network_type}}_install.yml
    product:
      name: slingshot-host-software
      version: "{{slingshot_host_software.version}}"
      branch: "{{slingshot_host_software.working_branch}}"
    special_parameters:
      ims_require_dkms: true
  - name: cscs-interfaces
    playbook: cscs-interfaces.yml
    git: 
      url: https://api-gw-service-nmn.local/vcs/cray/cscs-config-management.git
      branch: cscs-23.07.0
  - name: cos-compute-{{uss.working_branch}}
    playbook: cos-compute.yml
    product:
      name: uss
      version: "{{uss.version}}"
      branch: "{{uss.working_branch}}"
    special_parameters:
      ims_require_dkms: true
# The gpu_customize_net_playbook.yml playbook installs GPU network-dependent
# software and any additional GPU packages needed. The playbook will run by
# default if GPU content is available in Nexus, and will be skipped if not. If
# a non-gpu compute-only image is required this layer can be commented out.
#BEGIN_GPU_SUPPORT
  - name: uss-gpu-customize-net-playbook-{{uss.working_branch}}
    playbook: gpu_customize_net_playbook.yml
    product:
      name: uss
      version: "{{uss.version}}"
      branch: "{{uss.working_branch}}"
    special_parameters:
      ims_require_dkms: true
#END_GPU_SUPPORT
  - name: csm-packages-{{csm.version}}
    playbook: csm_packages.yml
    product:
      name: csm
      version: "{{csm.version}}"
  - name: csm-diags-compute-{{csm_diags.version}}
    playbook: csm-diags-compute.yml
    product:
      name: csm-diags
      version: "{{csm_diags.version}}"
  - name: sma-ldms-compute-{{sma.version}}
    playbook: sma-ldms-compute.yml
    product:
      name: sma
      version: "{{sma.version}}"
#  - name: cpe-pe_deploy-{{cpe.working_branch}}
#    playbook: pe_deploy.yml
#    product:
#      name: cpe
#      version: "{{cpe.version}}"
#      branch: "cscs-23.07.0"
##BEGIN_SLURM_SUPPORT
#  - name: slurm-site-{{slurm.working_branch}}
#    playbook: site.yml
#    product:
#      name: slurm
#      version: "{{slurm.version}}"
#      branch: "{{slurm.working_branch}}"
##END_SLURM_SUPPORT
  - name: cscs
    playbook: site.yml
    git:
      url: https://api-gw-service-nmn.local/vcs/cray/cscs-config-management.git
      branch: cscs-23.07.0
  - name: nomad
    playbook: site-client.yml
    git:
      url: https://api-gw-service-nmn.local/vcs/cray/nomad_orchestrator.git
      branch: main
  - name: cos-compute-last-{{uss.working_branch}}
    playbook: cos-compute-last.yml
    product:
      name: uss
      version: "{{uss.version}}"
      branch: "{{uss.working_branch}}"
    special_parameters:
      ims_require_dkms: true

images:
# Uncomment the lines below if ARM images are needed.
#BEGIN_AARCH64_SUPPORT
- name: "{{default.note}}-compute-{{default.suffix}}"
  ref_name: compute_image.aarch64
  base:
    ims: 
      name: "gracehopper-uss-1.0.0-58-csm-1.5.aarch64-1"
      type: image
  configuration: "{{default.note}}-compute-config-{{default.suffix}}"
  configuration_group_names:
  - Compute
  - prealps
  - santis
#END_AARCH64_SUPPORT

session_templates:
# Uncomment the lines below if ARM session templates are needed.
#BEGIN_AARCH64_SUPPORT
- name: "{{default.note}}-compute-template-{{default.suffix}}"
  image:
    image_ref: compute_image.aarch64
  configuration: "{{default.note}}-compute-config-{{default.suffix}}"
  bos_parameters:
    boot_sets:
      compute:
        arch: ARM
        kernel_parameters: ip=dhcp quiet ksocklnd.skip_mr_route_setup=1 cxi_core.disable_default_svc=0 spire_join_token=${SPIRE_JOIN_TOKEN}
        node_roles_groups:
        - Compute
        - prealps
        - santis
        rootfs_provider_passthrough: "dvs:api-gw-service-nmn.local:300:hsn0,nmn0:0"
- name: "{{default.note}}-compute-template-{{default.suffix}}-ramdisk"
  image:
    image_ref: compute_image.aarch64
  configuration: "{{default.note}}-compute-config-{{default.suffix}}"
  bos_parameters:
    boot_sets:
      compute:
        arch: ARM
        kernel_parameters: ip=dhcp quiet ksocklnd.skip_mr_route_setup=1 cxi_core.disable_default_svc=0 spire_join_token=${SPIRE_JOIN_TOKEN}
        node_roles_groups:
        - Compute
        - prealps
        - santis
        rootfs_provider_passthrough: "dvs:api-gw-service-nmn.local:300:hsn0,nmn0:1"
#END_AARCH64_SUPPORT

And session vars file being:

---

base_image: "gracehopper-base-cscs-uss-1.0.0-58-csm-1.5.aarch64-shs-2.1.1-64-cos-3.0-aarch64-compute-image-20"

default:
  network_type: cassini
  note: 'santis'
  suffix: 23.11.0-beta.5-9
  wlm: slurm
  working_branch: "cscs-23.07.0"

slingshot:
  version: 2.1.1-894

slingshot-host-software:
  version: 2.1.1-64-cos-3.0-aarch64
  working_branch: cscs-23.07.0

sma:
  version: 1.9.5

uan:
  version: 2.7.1
  working_branch: cscs-23.07.0

uss:
  version: 1.0.0-58-csm-1.5
  working_branch: cscs-23.07.0-no-nvhpc
Masber commented 8 months ago

@miguelgila it is not clear to me how the session vars is generated or from where it comes from, is this the schema of this file fixed? the template has this field csm_diags.version which does not exists in the sessions var

miguelgila commented 8 months ago

This particular vars file seems to have been taken from another place in CSM, it lists all the components of what I think is one of their recipes. We can simplify/clean it as much as we want, the only vars needed are what's in the sat yaml file.

Please note that not all the fields in the sat file are templatable, for example the ims recipe or image name. Which is sub-optimal as one would want to have the sat file clean and use vars everywhere. Maybe this is something we can do in manta?

miguelgila commented 8 months ago

@Masber this is a more realistic variables file:

---

base_image: "gracehopper-base-cscs-uss-1.0.0-58-csm-1.5.aarch64-shs-2.1.1-64-cos-3.0-aarch64-compute-image-20"

default:
  network_type: cassini
  note: 'santis'
  suffix: 23.11.0-beta.5-9
  wlm: slurm
  working_branch: "cscs-23.07.0"

slingshot:
  version: 2.1.1-894

slingshot-host-software:
  version: 2.1.1-64-cos-3.0-aarch64
  working_branch: cscs-23.07.0

sma:
  version: 1.9.5

uan:
  version: 2.7.1
  working_branch: cscs-23.07.0

uss:
  version: 1.0.0-58-csm-1.5
  working_branch: cscs-23.07.0-no-nvhpc

As you can see some of those fields have been copied from the same location as the previous one, but other ones like base_image are completely arbitrary and created by us.

Masber commented 8 months ago

example of a SAT template file:

❯ cat sat-file/sat_file-zinal-cta-client-template.yaml
configurations:
- name: "{{ config.name }}-{{ config.version }}"
  layers:
  - name: ss11
    playbook: shs_cassini_install.yml
    git:
      url: https://api-gw-service-nmn.local/vcs/cray/slingshot-host-software-config-management.git
      branch: integration
  - name: cos
    playbook: site.yml
    product:
      name: cos
      version: 2.3.101
      branch: integration
  - name: cscs
    playbook: site.yml
    git:
      url: https://api-gw-service-nmn.local/vcs/cray/cscs-config-management.git
      branch: cscs-23.06.0
  - name: nomad-orchestrator
    playbook: site-client.yml
    git:
      url: https://api-gw-service-nmn.local/vcs/cray/nomad_orchestrator.git
      branch: main
images:
- name: zinal-nomad-{{ image.version }}
  ims:
    is_recipe: false
    id: 4bf91021-8d99-4adf-945f-46de2ff50a3d
  configuration: "{{ config.name }}-{{ config.version }}"
  configuration_group_names:
  - Compute
  - "{{ hsm.group_name }}"

session_templates:
- name: "{{ bos_st.name }}"
  image: zinal-image-v0.5
  configuration: "{{ config.name }}-{{ config.version }}"
  bos_parameters:
    boot_sets:
      compute:
        kernel_parameters: ip=dhcp quiet spire_join_token=${SPIRE_JOIN_TOKEN}
        node_groups:
        - "{{ hsm.group_name }}"

And the values file:

❯ cat sat-file/sat_file-zinal-cta-client-values.yaml
---
hsm:
  group_name: "zinal_cta"
config:
  name: "test-config"
  version: "v1.0.0"
image:
  version: "v1.0.5"
bos_st:
  name: "deploy-cluster-action"
  version: "v1.0"

And the result rendered file:

manta a cluster -f sat-file/sat_file-zinal-cta-client-template.yaml -V sat-file/sat_file-zinal-cta-client-values.yaml`
DEBUG SAT file rendered:
:configurations:
- name: "test-config-v1.0.0"
  layers:
  - name: ss11
    playbook: shs_cassini_install.yml
    git:
      url: https://api-gw-service-nmn.local/vcs/cray/slingshot-host-software-config-management.git
      branch: integration
  - name: cos
    playbook: site.yml
    product:
      name: cos
      version: 2.3.101
      branch: integration
  - name: cscs
    playbook: site.yml
    git:
      url: https://api-gw-service-nmn.local/vcs/cray/cscs-config-management.git
      branch: cscs-23.06.0
  - name: nomad-orchestrator
    playbook: site-client.yml
    git:
      url: https://api-gw-service-nmn.local/vcs/cray/nomad_orchestrator.git
      branch: main
images:
- name: zinal-nomad-v1.0.5
  ims:
    is_recipe: false
    id: 4bf91021-8d99-4adf-945f-46de2ff50a3d
  configuration: "test-config-v1.0.0"
  configuration_group_names:
  - Compute
  - "zinal_cta"

session_templates:
- name: "deploy-cluster-action"
  image: zinal-image-v0.5
  configuration: "test-config-v1.0.0"
  bos_parameters:
    boot_sets:
      compute:
        kernel_parameters: ip=dhcp quiet spire_join_token=${SPIRE_JOIN_TOKEN}
        node_groups:
        - "zinal_cta"
Masber commented 8 months ago

implemented in version v1.22.9