aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
827 stars 312 forks source link

CloudWatch: costs more than compute. More fine grained config/pre-filtering needed #2162

Closed brainstorm closed 3 years ago

brainstorm commented 3 years ago

Required Info:

Bug description and how to reproduce:

This issue is simple: Using the above config details we instantiated a cluster that consumes more on logging than on actual compute:

Screen Shot 2020-10-22 at 9 39 04 am

Furthermore the LogGroup where those ~$25 worth of logs are gone since the cluster's CloudFormation was deleted as part of the shutdown of the cluster.

So a couple of improvements that could help a lot:

1) Have a cw_logs section in the pcluster config that allows to keep the LogGroups across cluster lifecycles (we usually tear down the cluster when done with the job(s)). 2) Have some sort of cloudwatch-agent-related section directives that allow us to filter which specific logfiles to watch and (perhaps) filter or rate-limit some contents before they get sent to AWS CW?

I guess that the latter could be done via some bootstrap fine-tuning, but still, it'd be useful to have it exposed in the config, I reckon.

We'll have to disable CW logs for the time being, which is not very optimal in terms of security/auditability of parallel cluster, but log ingestion costing more than actual compute is not good either :-S

/cc @reisingerf @alexiswl @victorskl

enrico-usai commented 3 years ago

Hi Roman,

thanks for your suggestions and for the self-explanatory screenshot.

Regarding the suggested improvements:

Thanks

Enrico

brainstorm commented 3 years ago
* In the shared configuration file you have multiple `[cluster]` sections so it's not clear to me if these costs are coming from a single cluster and which is the used configuration.

tothill cluster in this case, as mentioned above.

* It would be useful to know the kind of job submission you're doing to understand if there is a corner case that is producing tons of logs (e.g. from slurm, /var/log/messages or syslog).

SLURM, just sbatch of either of the submit.sh scripts under https://github.com/UMCCR-RADIO-Lab/aws-parallelcluster-pilot

* Could you create and share an archive of CloudWatch logs by following [this guide](https://github.com/aws/aws-parallelcluster/wiki/Creating-an-Archive-of-a-Cluster's-Logs) or share the larger files in your head node from [this list](https://docs.aws.amazon.com/parallelcluster/latest/ug/cloudwatch-logs.html)?

We'll consider doing that next time we instantiate a cluster, thanks for the reference!

* How did you produce this screenshot?

Cost Explorer, filtered by @reisingerf

Regarding the suggested improvements:

* keep logs across cluster lifecycle:

  * We decided to not preserve logGroups across cluster lifecycles because you can create a new cluster with the same name but with a totally different configuration (different IP ranges, different version of ParallelCluster, different OS or scheduler, different uploaded log files) so the number of log streams would keep growing, increasing costs but losing its meaning and usefulness.
  * In any case it's already possible to retain the logs of a running cluster by executing [`pcluster delete --keep-logs ...`](https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.delete.html) command and you can configure the retention time by using the `retention_days` configuration parameter in [`cw_logs`](https://docs.aws.amazon.com/parallelcluster/latest/ug/cw-log-section.html) section.

* log file filters:

  * I think it's a good idea, so I'm marking the issue as feature enhancement, thanks for the suggestion.

Thanks a lot for the pointers, I didn't know about --keep-logs, what you mention makes sense, logs cannot be correlated easily between instantiations so they lose their analytical value. In any case I hope we can hit the issue (unless it's a simple Cost Explorer filtering hiccup (just thought about that know, but I doubt it), /cc @reisingerf)

enrico-usai commented 3 years ago

Thanks Roman,

tothill cluster in this case, as mentioned above.

I asked for the [cluster] section used because sometimes the Cluster Name doesn't correspond to the [cluster] section. Thanks for the clarification.

just sbatch of either of the submit.sh scripts under https://github.com/UMCCR-RADIO-Lab/aws-parallelcluster-pilot

Interesting, thanks for sharing.

In any case I hope we can hit the issue...

If you're going to re-create the cluster (even if without CloudWatch feature enabled) please try to double check the log files in your head node from this list . It would be very useful to see which files are growing too much. I'd pay attention to /var/log/messages, /var/log/syslog and /var/log/slurm* files.

Let us know if you have more information, we're very interested in understanding the root cause of this issue.

Thanks!

brainstorm commented 3 years ago

Thanks for the feedback, we just had another instantiation and found that SLURM generated 3.68GB worth of CloudWatch logs in 3 days. Here you have an snippet of what's going on, reporting status several times per minute:

2020-11-02 17:10:13,718 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf -- 2020-11-02 17:10:13,720 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:10:13,726 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:10:13,726 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:10:18,775 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:10:18,848 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:10:19,065 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:10:19,065 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:10:19,065 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:10:19,066 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:10:19,068 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:10:19,074 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:10:19,075 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:10:24,119 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:10:24,191 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:10:24,403 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:10:24,404 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:10:24,404 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:10:24,404 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:10:24,406 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:10:24,412 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:10:24,413 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:10:29,458 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:10:29,583 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:10:29,800 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:10:29,800 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:10:29,801 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:10:29,802 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:10:29,803 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:10:29,810 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:10:29,810 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:10:34,859 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:10:34,921 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:10:35,113 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:10:35,113 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:10:35,113 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:10:35,114 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:10:35,116 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:10:35,123 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:10:35,123 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:10:40,171 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:10:40,290 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:10:40,481 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:10:40,482 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:10:40,482 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:10:40,482 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:10:40,484 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:10:40,491 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:10:40,491 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:10:45,539 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:10:45,602 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:10:45,791 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:10:45,791 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:10:45,792 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:10:45,793 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:10:45,795 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:10:45,801 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:10:45,802 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:10:50,849 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:10:50,911 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:10:51,113 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:10:51,113 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:10:51,113 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:10:51,114 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:10:51,116 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:10:51,122 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:10:51,122 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:10:56,167 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:10:56,234 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:10:56,429 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:10:56,430 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:10:56,430 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:10:56,430 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:10:56,432 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:10:56,439 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:10:56,439 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:11:01,486 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:11:01,551 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:11:01,801 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:11:01,801 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:11:01,801 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:11:01,802 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:11:01,804 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:11:01,811 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:11:01,811 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:11:06,860 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:11:06,923 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:11:07,122 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:11:07,122 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:11:07,122 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:11:07,123 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:11:07,125 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:11:07,131 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:11:07,132 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:11:12,180 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:11:12,321 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:11:12,517 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:11:12,518 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:11:12,518 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:11:12,518 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:11:12,520 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:11:12,527 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:11:12,527 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:11:17,572 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:11:17,637 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:11:17,833 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:11:17,833 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:11:17,834 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:11:17,835 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:11:17,837 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:11:17,843 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:11:17,843 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:11:22,891 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:11:22,960 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:11:23,215 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:11:23,216 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:11:23,216 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:11:23,216 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:11:23,218 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:11:23,225 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:11:23,225 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:11:28,267 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:11:28,333 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:11:28,571 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:11:28,572 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:11:28,572 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:11:28,572 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:11:28,574 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:11:28,581 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:11:28,581 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:11:33,628 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:11:33,702 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:11:33,890 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:11:33,891 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:11:33,891 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:11:33,892 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:11:33,894 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:11:33,900 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:11:33,901 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:11:38,942 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:11:39,011 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:11:39,207 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:11:39,207 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:11:39,207 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:11:39,208 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:11:39,210 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:11:39,216 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:11:39,217 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:11:44,259 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:11:44,322 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:11:44,537 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:11:44,538 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:11:44,538 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:11:44,538 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:11:44,540 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:11:44,548 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:11:44,549 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:11:49,596 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:11:49,659 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:11:49,897 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:11:49,898 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:11:49,898 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:11:49,899 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:11:49,901 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:11:49,907 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:11:49,907 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:11:54,951 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:11:55,013 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:11:55,208 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:11:55,209 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:11:55,209 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:11:55,209 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:11:55,211 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:11:55,218 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:11:55,218 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:12:00,262 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:12:00,328 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:12:00,523 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:12:00,524 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:12:00,524 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:12:00,524 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:12:00,526 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:12:00,533 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:12:00,533 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:12:05,579 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:12:05,656 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:12:05,886 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:12:05,886 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:12:05,887 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:12:05,888 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:12:05,890 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:12:05,897 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:12:05,897 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:12:10,946 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:12:11,025 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:12:11,250 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:12:11,250 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:12:11,250 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:12:11,251 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:12:11,253 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:12:11,259 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2020-11-02 17:12:11,259 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2020-11-02 17:12:16,306 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster 2020-11-02 17:12:16,378 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2020-11-02 17:12:16,633 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2020-11-02 17:12:16,633 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2020-11-02 17:12:16,633 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance 2020-11-02 17:12:16,634 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2020-11-02 17:12:16,636 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2020-11-02 17:12:16,642 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING
demartinofra commented 3 years ago

Hi Roman,

this should not be happening and clustermgtd is by default configured to run once per minute.

If you still have it (probably you need to retrieve it from CloudWatch) can you paste the initial lines of the clustermgtd log file where the daemon config is reported (see example below)? I'd like to double check what value is being used for loop_time. Also can you report the content of /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf?

Here is an example from a test cluster:

2020-10-26 17:04:06,940 INFO [clustermgtd:_get_config] Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2020-10-26 17:04:06,941 INFO [clustermgtd:set_config] Applying new clustermgtd config: ClustermgtdConfig(_config=<configparser.ConfigParser object at 0x7f4fb36cc6d8>, region='us-east-2', cluster_name='centos7', dynamodb_table='parallelcluster-centos7', master_private_ip='10.0.0.196', master_hostname='ip-10-0-0-196.us-east-2.compute.internal', instance_name_type_mapping={'c4xlarge': 'c4.xlarge', 'c5xlarge': 'c5.xlarge'}, _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, loop_time=60, disable_all_cluster_management=False, heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', boto3_config=<botocore.config.Config object at 0x7f4fb1da4f60>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_clustermgtd_logging.conf', disable_ec2_health_check=False, disable_scheduled_event_health_check=False, health_check_timeout=180, disable_all_health_checks=False, launch_max_batch_size=500, update_node_address=True, terminate_max_batch_size=1000, node_replacement_timeout=3600, terminate_drain_nodes=True, terminate_down_nodes=True, orphaned_instance_timeout=120, hosted_zone='Z023123...', dns_domain='centos7.pcluster', use_private_hostname=False)
alexiswl commented 3 years ago

Cluster MGTD Configuration File

(base) [ec2-user@ip-10-2-1-15 ~]$ cat /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
[clustermgtd]
cluster_name = check-log-watcher-tothill
region = ap-southeast-2
proxy = NONE
heartbeat_file_path = /opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat
dynamodb_table = parallelcluster-check-log-watcher-tothill
hosted_zone = Z05042051WV9DSJ2UI80W
dns_domain = check-log-watcher-tothill.pcluster
use_private_hostname = false
master_private_ip = 10.2.1.15
master_hostname = ip-10-2-1-15.ap-southeast-2.compute.internal

Logs

Head

(base) [ec2-user@ip-10-2-1-15 ~]$ cat /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
[clustermgtd]
cluster_name = check-log-watcher-tothill
region = ap-southeast-2
proxy = NONE
heartbeat_file_path = /opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat
dynamodb_table = parallelcluster-check-log-watcher-tothill
hosted_zone = Z05042051WV9DSJ2UI80W
dns_domain = check-log-watcher-tothill.pcluster
use_private_hostname = false
master_private_ip = 10.2.1.15
master_hostname = ip-10-2-1-15.ap-southeast-2.compute.internal

Tail

(base) [ec2-user@ip-10-2-1-15 ~]$ cat /var/log/parallelcluster/clustermgtd | tail
2020-11-03 13:02:44,096 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
2020-11-03 13:02:49,192 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster
2020-11-03 13:02:49,248 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions
2020-11-03 13:02:49,433 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2020-11-03 13:02:49,434 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2020-11-03 13:02:49,434 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
2020-11-03 13:02:49,434 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2020-11-03 13:02:49,436 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2020-11-03 13:02:49,443 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING
2020-11-03 13:02:49,443 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler

Finding the bug:

I placed a suite of logging.info() commands inside the sleep_remaining_loop_time in /opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/common/utils.py.

This function is used by the clustermgtd service at /opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/clustermgtd.py.

Here's the new set of logging commands I've placed and their respective outputs:

Sleep Time Original

def sleep_remaining_loop_time(total_loop_time, loop_start_time=None):
    end_time = datetime.now()
    if not loop_start_time:
        loop_start_time = end_time
    # Always try to localize datetime objects to UTC if not previously localized
    # For example, datetime in tradition daemons are not localized(sqswatcher, jobwatcher, ...)
    # Daemons in HIT integration are required to use localized datetime
    # This operation DOES NOT convert datetime from 1 timezone into UTC
    # It simply replaces the timezone info if datetime is not localized
    if not end_time.tzinfo:
        end_time = end_time.replace(tzinfo=timezone.utc)
    if not loop_start_time.tzinfo:
        loop_start_time = loop_start_time.replace(tzinfo=timezone.utc)
    time_delta = (end_time - loop_start_time).total_seconds()
    if time_delta < total_loop_time:
        time.sleep(total_loop_time - time_delta)

Sleep Time With Logging interspersed

def sleep_remaining_loop_time(total_loop_time, loop_start_time=None):
    end_time = datetime.now()
    if not loop_start_time:
        loop_start_time = end_time
    # Always try to localize datetime objects to UTC if not previously localized
    # For example, datetime in tradition daemons are not localized(sqswatcher, jobwatcher, ...)
    # Daemons in HIT integration are required to use localized datetime
    # This operation DOES NOT convert datetime from 1 timezone into UTC
    # It simply replaces the timezone info if datetime is not localized
    if not end_time.tzinfo:
        end_time = end_time.replace(tzinfo=timezone.utc)
        ## DEBUG ##
        logging.info("1A: Could not find tzinfo on end time")
    if not loop_start_time.tzinfo:
        loop_start_time = loop_start_time.replace(tzinfo=timezone.utc)
        ## DEBUG ##
        logging.info("1B: Could not find tzinfo on loop start time")

    ## DEBUG ##
    logging.info("2A: end_time is: {}".format(end_time))
    logging.info("2B: loop_start_time is: {}".format(loop_start_time))

    time_delta = (end_time - loop_start_time).total_seconds()
    ## DEBUG ##
    logging.info("3: This is the time diff: {}".format(time_delta))
    ## DEBUG ##
    logging.info("4: This is the total loop time: {}".format(total_loop_time))
    if time_delta < total_loop_time:
        time.sleep(total_loop_time - time_delta)

I then restarted the supervisor daemon and witnessed the logs.

$ sudo systemctl restart supervisord
$ cat /var/log/parallelcluster/clustermgtd | tail -n20
2020-11-03 13:57:16,771 INFO [clustermgtd:_get_config] Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2020-11-03 13:57:16,772 INFO [clustermgtd:set_config] Applying new clustermgtd config: ClustermgtdConfig(_config=<configparser.ConfigParser object at 0x7ff692a51d30>, region='ap-southeast-2', cluster_name='check-log-watcher', dynamodb_table='parallelcluster-check-log-watcher', master_private_ip='10.2.1.113', master_hostname='ip-10-2-1-113.ap-southeast-2.compute.internal', instance_name_type_mapping={'c54xlarge': 'c5.4xlarge', 'm54xlarge': 'm5.4xlarge', 'm5large': 'm5.large'}, _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, loop_time=60, disable_all_cluster_management=False, heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', boto3_config=<botocore.config.Config object at 0x7ff6911556d8>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_clustermgtd_logging.conf', disable_ec2_health_check=False, disable_scheduled_event_health_check=False, health_check_timeout=180, disable_all_health_checks=False, launch_max_batch_size=500, update_node_address=True, terminate_max_batch_size=1000, node_replacement_timeout=3600, terminate_drain_nodes=True, terminate_down_nodes=True, orphaned_instance_timeout=120, hosted_zone='Z016910132AE5GRPHOGCV', dns_domain='check-log-watcher.pcluster', use_private_hostname=False)
2020-11-03 13:57:16,859 INFO [clustermgtd:_get_config] Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2020-11-03 13:57:16,861 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2020-11-03 13:57:16,911 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING
2020-11-03 13:57:16,912 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
2020-11-03 13:57:22,011 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster
2020-11-03 13:57:22,161 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions
2020-11-03 13:57:22,363 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2020-11-03 13:57:22,363 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2020-11-03 13:57:22,364 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
2020-11-03 13:57:22,364 - [root:sleep_remaining_loop_time] - INFO - 1A: Could not find tzinfo on end time
2020-11-03 13:57:22,364 - [root:sleep_remaining_loop_time] - INFO - 2A: end_time is: 2020-11-03 13:57:22.364456+00:00
2020-11-03 13:57:22,364 - [root:sleep_remaining_loop_time] - INFO - 2B: loop_start_time is: 2020-11-03 02:57:16.859671+00:00
2020-11-03 13:57:22,364 - [root:sleep_remaining_loop_time] - INFO - 3: This is the time diff: 39605.504785
2020-11-03 13:57:22,364 - [root:sleep_remaining_loop_time] - INFO - 4: This is the total loop time: 60
2020-11-03 13:57:22,364 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2020-11-03 13:57:22,366 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2020-11-03 13:57:22,374 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING
2020-11-03 13:57:22,374 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler

It appears that end_time and loop_start_time are defined with different time-zones.

loop_start_time is defined in /opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/clustermgtd.py as start_time = datetime.now(tz=timezone.utc) but end_time is defined as datetime.now() at the top of the sleep_remaining_loop_time function.

Setting endtime to datetime.now(tz=timezone.utc) at the top of the sleep_remaining_loop_time function appears to have resolved the issue.

$ sudo systemctl restart supervisord
$ cat /var/log/parallelcluster/clustermgtd | tail -n20
2020-11-03 14:01:52,436 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2020-11-03 14:01:52,436 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2020-11-03 14:01:52,437 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
2020-11-03 14:01:52,437 - [root:sleep_remaining_loop_time] - INFO - 2A: end_time is: 2020-11-03 03:01:52.437473+00:00
2020-11-03 14:01:52,437 - [root:sleep_remaining_loop_time] - INFO - 2B: loop_start_time is: 2020-11-03 03:01:46.954041+00:00
2020-11-03 14:01:52,437 - [root:sleep_remaining_loop_time] - INFO - 3: This is the time diff: 5.483432
2020-11-03 14:01:52,437 - [root:sleep_remaining_loop_time] - INFO - 4: This is the total loop time: 60
2020-11-03 14:02:47,000 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2020-11-03 14:02:47,002 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2020-11-03 14:02:47,009 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING
2020-11-03 14:02:47,009 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
2020-11-03 14:02:52,109 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster
2020-11-03 14:02:52,225 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions
2020-11-03 14:02:52,467 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2020-11-03 14:02:52,467 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2020-11-03 14:02:52,468 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
2020-11-03 14:02:52,468 - [root:sleep_remaining_loop_time] - INFO - 2A: end_time is: 2020-11-03 03:02:52.468372+00:00
2020-11-03 14:02:52,468 - [root:sleep_remaining_loop_time] - INFO - 2B: loop_start_time is: 2020-11-03 03:02:47.000190+00:00
2020-11-03 14:02:52,468 - [root:sleep_remaining_loop_time] - INFO - 3: This is the time diff: 5.468182
2020-11-03 14:02:52,468 - [root:sleep_remaining_loop_time] - INFO - 4: This is the total loop time: 60
alexiswl commented 3 years ago

Update

Having a look at what other scripts use this sleep_remaining_loop_time, there appears to be some inconsistencies.

The following scripts define their loop_start_time input parameter with just datetime.now()

/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/jobwatcher/jobwatcher.py
/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/nodewatcher/nodewatcher.py
/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/sqswatcher/sqswatcher.py

While these ones define their loop_start_time input parameter with datetime.now(tz.timezone.utc).

/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/computemgtd.py
/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/clustermgtd.py

hence updating the sleep_remaining_loop_time may have some unintended side-effects. Instead I propose the use of just datetime.now() over datetime.now(tz=timezone.utc) inside the /opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/computemgtd.py and /opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/clustermgtd.py scripts instead.

demartinofra commented 3 years ago

Alexis, thanks for the very thorough and accurate analysis!

I can confirm that I was able to reproduce the issue on our side. As you pointed out the culprit is the inconsistency in the timezone used to read the start and end time of the loop which had been masked by the fact that all our AMIs have by default the timezone set to UTC. Since you are setting the timezone to a different one ("Australia/Melbourne"), which is a totally legitimate operation, the issue is being triggered.

I will mark this as a bug and we will soon merge a fix that will be part of the upcoming release.

Thanks again for reporting this and doing most of the work required to identify and fix the root-cause!

Francesco

enrico-usai commented 3 years ago

2.10.0 release is out. Thanks for reporting this and for the detailed analysis.