aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
817 stars 310 forks source link

New FSx DNS name does not automatically update in /etc/fstab #1766

Open tennex-jack opened 4 years ago

tennex-jack commented 4 years ago

Environment:

[aws]
aws_region_name = us-east-1

[global]
cluster_template = default
update_check = true
sanity_check = true

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[cluster default]
key_name = xxx
base_os = ubuntu1804
scheduler = slurm
max_queue_size = 20
maintain_initial_size = true
vpc_settings = slurm
compute_instance_type = c4.8xlarge
master_instance_type = r5a.4xlarge
#extra_json = {"cluster": {"ganglia_enabled": "yes"}}
fsx_settings = slurmfsx
initial_queue_size = 0
scaling_settings = slurm

[fsx slurmfsx]
# comment out shared_dir and storage_capacity when adding fsx id
shared_dir = /fsx
storage_capacity = 1200
#fsx_fs_id = <input fsx id>

[vpc slurm]
vpc_id = vpc-xxx
master_subnet_id = subnet-xxx
compute_subnet_id = subnet-xxx
use_public_ips = false

[scaling slurm]
scaledown_idletime = 15

Bug description and how to reproduce: The FSx Lustre file system mount point is automatically added to /etc/fstab during initial cluster launch.

When making a change to an FSx Lustre file system that results in a replacement (e.g. altering the deployment_type), the file system itself properly deletes and redeploys, but the FSx mount point does not properly update in /etc/fstab on the cluster nodes.

In this example, the previous FSx file system ID was fs-0c72f1afba7504877, but the newly created one was obviously different (fs-054a3ce927a2ac593)

ubuntu@ip-10-0-30-157:~$ cat /etc/fstab
LABEL=cloudimg-rootfs   /        ext4   defaults,discard        0 0
UUID=8ec824a0-5eec-42a0-80be-9ecd13d6f38d /shared ext4 _netdev 0 0
fs-0c72f1afba7504877.fsx.us-east-1.amazonaws.com@tcp:/fsx /fsx lustre defaults,_netdev,flock,user_xattr,noatime,noauto,x-systemd.automount 0 0

image (9)

I was not able to test if new compute nodes scale in with the correct file system ID, but I did confirm that existing nodes did not get updated when the FSx file system redeployed.

Let me know if there's any other information or test cases I can provide, thanks!

rexcsn commented 4 years ago

Hi @privojack ,

Thanks for reporting this issue. I assume you are changing deployment_type of your FSx by doing a pcluster update?

If so, unfortunately you are hitting on some existing pcluster update limitations. In general, pcluster update does not support updating filesystems at the moment. As you observed, the filesystem resources might be updated, but we do not re-execute code that does the filesystem mounting during pcluster update, so updating the cluster's filesystem will not work now.

We understand that this is not a straightforward behavior at the moment. We are working on having better public documentation for pcluster update behaviors and thinking about how to enhance pcluster update in the future.

Thank you!

tennex-jack commented 4 years ago

@rexcsn thanks for for your reply!

Yes that's exactly right. No problem at all, I can get around it with a sed and some scripting.

I'm glad the team's aware, thanks for all the hard work being done on maintaining such a great platform!

enrico-usai commented 4 years ago

I'm marking this as enhancement. Thanks

tennex-jack commented 4 years ago

Thanks so much!