Closed ek-nag closed 5 months ago
Hi,
This seems to be a limitation with terraform/gcp and there might not be an easy way to make it unique. This will need more investigation on whether creating google_service_networking_connection out of the SQL module into VPC module would be a better way or not. Since the SQL module is community experimental, we cannot make any specific commitment about timeline.
This issue is stale because it has been open for 30 days with no activity.
Hi Eimantas,
On the PR 2397 I propose to decouple the configuration of Private Service Access from the Cloud SQL creation. This would allow you to configure it once (for the lifetime of the VPC) and then use it across multiple clusters in that VPC, which I believe addresses your key concern here.
With this change, the following blueprint is valid (as an example):
blueprint_name: two-clusters
vars:
project_id: hpc-toolkit-demo
deployment_name: clstr2
region: us-east1
zone: us-east1-b
enable_reconfigure: True
enable_cleanup_compute: False
enable_cleanup_subscriptions: True
enable_bigquery_load: True
instance_image_custom: True
deployment_groups:
- group: net
modules:
- source: modules/network/vpc
id: hpc_network
- source: ./community/modules/network/private-service-access
id: ps-connect
use: [ hpc_network ]
- group: cluster1
modules:
- source: community/modules/compute/schedmd-slurm-gcp-v5-partition
kind: terraform
id: c1partition_0
use:
- c1partition_0-group
- hpc_network
settings:
partition_name: c1batch
enable_placement: False
exclusive: False
- source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
id: c1partition_0-group
use:
settings:
enable_smt: False
machine_type: c2-standard-60
node_count_dynamic_max: 4
node_count_static: 0
disk_size_gb: 50
disk_type: pd-standard
- source: ./community/modules/database/slurm-cloudsql-federation
kind: terraform
id: c1slurm-sql
use: [hpc_network, ps-connect]
settings:
sql_instance_name: sql-cluster-1
tier: "db-g1-small"
- source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
kind: terraform
id: c1slurm_controller
settings:
cloud_parameters:
resume_rate: 0
resume_timeout: 500
suspend_rate: 0
suspend_timeout: 300
no_comma_params: false
machine_type: n2-standard-2
disk_type: pd-standard
disk_size_gb: 50
slurm_cluster_name: cluster1
use:
- hpc_network
- c1partition_0
- c1slurm-sql
- source: community/modules/scheduler/schedmd-slurm-gcp-v5-login
kind: terraform
id: c1slurm_login
use: [c1slurm_controller, hpc_network]
settings:
num_instances: 1
machine_type: n2-standard-2
disk_type: pd-standard
disk_size_gb: 50
- group: cluster2
modules:
- source: community/modules/compute/schedmd-slurm-gcp-v5-partition
kind: terraform
id: c2partition_0
use: [c2partition_0-group, hpc_network]
settings:
partition_name: c2batch
enable_placement: False
exclusive: False
- source: community/modules/compute/schedmd-slurm-gcp-v5-node-group
id: c2partition_0-group
settings:
enable_smt: False
machine_type: c2-standard-60
node_count_dynamic_max: 4
node_count_static: 0
disk_size_gb: 50
disk_type: pd-standard
- source: ./community/modules/database/slurm-cloudsql-federation
kind: terraform
id: c2slurm-sql
use: [hpc_network, ps-connect]
settings:
sql_instance_name: sql-cluster-2
tier: "db-g1-small"
- source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
kind: terraform
id: c2slurm_controller
use: [hpc_network, c2partition_0, c2slurm-sql]
settings:
cloud_parameters:
resume_rate: 0
resume_timeout: 500
suspend_rate: 0
suspend_timeout: 300
no_comma_params: false
machine_type: n2-standard-2
disk_type: pd-standard
disk_size_gb: 50
slurm_cluster_name: cluster2
- source: community/modules/scheduler/schedmd-slurm-gcp-v5-login
kind: terraform
id: c2slurm_login
use: [c2slurm_controller, hpc_network]
settings:
num_instances: 1
machine_type: n2-standard-2
disk_type: pd-standard
disk_size_gb: 50
Please note that the CloudSQL Federation does not really need to receive ps-connect
as it is only used to inform the terraform dependency graph and prevent the attempt to create the Cloud SQL instance before private access is configured.
I would invite you to try this out and tell me if this fix the problem.
Folks, this was now merged to develop
. I will close this for now.
Describe the bug
module.slurm-sql.google_service_networking_connection.private_vpc_connection
is not unique per cluster. When two clusters are created in the same VPC and subnet both of them relies on the samegoogle_service_networking_connection
. When one cluster is destroyed thegoogle_service_networking_connection
gets removed breakingslurm-sql
operation for the remaining cluster.The easiest solution would be adding unique
name
argument to thegoogle_service_networking_connection
resource same as we doing forgoogle_compute_global_address.private_ip_address
.However, module in Terraform provider does not have
name
argument.Steps to reproduce
Steps to reproduce the behavior:
community/modules/database/slurm-cloudsql-federation
module. For example:Expected behavior
Make
module.slurm-sql.google_service_networking_connection.private_vpc_connection
unique.Actual behavior
If two or more clusters reside in the same VPC and subnet they all use same
module.slurm-sql.google_service_networking_connection.private_vpc_connection
and when one of the clusters is deleted the`module.slurm-sql.google_service_networking_connection.private_vpc_connection
is also removed breaking remaining clusters access to cloudsql slurm accounting database.Version (
ghpc --version
)ghpc version v1.26.1
Blueprint
If applicable, attach or paste the blueprint YAML used to produce the bug.