aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
838 stars 312 forks source link

HeadNodeWaitCondition failure when FSx takes too long creating #4211

Closed jsmedmar closed 2 years ago

jsmedmar commented 2 years ago

HeadNodeWaitCondition times out because FSx file system takes too long to be created when ImportPath points to a s3 bucket with a very large number of files (>4 million objects; FSx create time >1h). The HeadNodeWaitCondition does not time out when ImportPath points to a small s3 bucket (FSx create time <20 mins).

This works:

SharedStorage:
  - Name: my-fsx
    StorageType: FsxLustre
    MountDir: /my-dir
    FsxLustreSettings:
      StorageCapacity: 2400
      DeploymentType: SCRATCH_2
      StorageType: SSD
      DataCompressionType: LZ4
      ExportPath: s3://small-s3-bucket
      ImportPath: s3://small-s3-bucket

This doesn't (import path from very large s3 bucket):

SharedStorage:
  - Name: my-fsx
    StorageType: FsxLustre
    MountDir: /mydir
    FsxLustreSettings:
      StorageCapacity: 2400
      DeploymentType: SCRATCH_2
      StorageType: SSD
      DataCompressionType: LZ4
      ExportPath: s3://very-large-s3-bucket-with-millions-of-objects
      ImportPath: s3://very-large-s3-bucket-with-millions-of-objects

Bug description and how to reproduce:

Create a cluster with a FSx file system having ImportPath point to a very large s3 bucket.

If you are reporting issues about cluster creation failure or node failure:

fsx-24T-new-ami-logs-202207221356.tar.gz

hanwen-pcluste commented 2 years ago

Hi jsmedmar,

Cluster creation times out after 30 minutes. To overcome this, you could create the FSx on AWS console and use FileSystemId under SharedStorge to mount it to the cluster.

In a future release (pcluster 3.2), nodes bootstrap timeout will be configurable.

Thank you, Hanwen

jsmedmar commented 2 years ago

Hanwen, thanks so much for the quick response.

For now what I'm doing is disabling the rollback. Eventhough the Cloud Formation stack fails, the cluster works ok.

Looking forward to 3.2

lukeseawalker commented 2 years ago

Hello @jsmedmar with ParallelCluster 3.2.0, we introduced the possibility to customize node bootstrap timeout as an experimental feature.

If you want to customize timeouts you can do that using the DevSettings section in you cluster configuration file, as follows:

DevSettings:
  Timeouts:
    HeadNodeBootstrapTimeout: 1234  # timeout in seconds
    ComputeNodeBootstrapTimeout: 1234  # timeout in seconds
jsmedmar commented 2 years ago

Perfect, will give it a try!

github-actions[bot] commented 2 years ago

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.