Closed sharif-cameco closed 1 year ago
I believe the log message you have reported is a spurious error. A user
packer is created while building the Slurm image that is mostly, but not 100%, removed before creating the image itself. This is the GCE Guest Agent fully removing the user but finding certain directories missing already. It should not influence Slurm boot.
That said, your machine is not joining the pool. I might suggest a more extensive look at the logs for it. This command will show you startup script logs:
gcloud logging --project prj-n-005-cloudops-618d read 'logName="projects/prj-n-005-cloudops-618d/logs/GCEMetadataScripts" AND resource.labels.instance_id="3405041953608146457"' --format="table(timestamp, jsonPayload.message)" --freshness 48h | tac
This will show all logs associated with the VM:
gcloud logging --project prj-n-005-cloudops-618d read 'resource.labels.instance_id="3405041953608146457"' --format="table(timestamp, jsonPayload.message)" --freshness 48h | tac
Consider changing 48h to a value appropriate to when the machine was active.
The packer user error has previously been reported to SchedMD (who publish the Slurm image used by your tutorial) and they anticipate resolving it in a near-term release.
issharif_c@cloudshell:~ (prj-n-005-cloudops-618d)$ gcloud logging --project prj-n-005-cloudops-618d read 'logName="projects/prj-n-005-cloudops-618d/logs/GCEMetadataScripts" AND resource.labels.instance_id="3405041953608146457"' --format="table(timestamp, jsonPayload.message)" --freshness 24h | tac MESSAGE: Starting startup scripts (version 20220713.00). TIMESTAMP: 2023-07-13T16:06:36.505847771Z
MESSAGE: Found startup-script in metadata. TIMESTAMP: 2023-07-13T16:06:36.527522434Z
MESSAGE: startup-script: ping -q -w1 -c1 metadata.google.internal TIMESTAMP: 2023-07-13T16:06:36.542862802Z
MESSAGE: startup-script: Successfully contacted metadata server TIMESTAMP: 2023-07-13T16:06:36.583101311Z
MESSAGE: startup-script: ping -q -w1 -c1 8.8.8.8 TIMESTAMP: 2023-07-13T16:06:36.583551646Z
MESSAGE: startup-script: failed to ping Google DNS, will retry TIMESTAMP: 2023-07-13T16:06:37.588137365Z
MESSAGE: startup-script: failed to ping Google DNS, will retry TIMESTAMP: 2023-07-13T16:06:40.624263840Z
MESSAGE: startup-script: failed to ping Google DNS, will retry TIMESTAMP: 2023-07-13T16:06:43.627927353Z
MESSAGE: startup-script: failed to ping Google DNS, will retry TIMESTAMP: 2023-07-13T16:06:46.631542202Z
MESSAGE: startup-script: failed to ping Google DNS, will retry TIMESTAMP: 2023-07-13T16:06:49.635413099Z
MESSAGE: startup-script: No internet access detected TIMESTAMP: 2023-07-13T16:06:49.635466895Z
MESSAGE: startup-script: curl: (22) The requested URL returned error: 404 Not Found TIMESTAMP: 2023-07-13T16:06:49.707748850Z
MESSAGE: startup-script: hpcsmall-slurm-devel not found in project metadata, skipping script update TIMESTAMP: 2023-07-13T16:06:49.708746575Z
MESSAGE: startup-script: running python cluster setup script TIMESTAMP: 2023-07-13T16:06:49.710040510Z
MESSAGE: startup-script: INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0 TIMESTAMP: 2023-07-13T16:06:52.116983239Z
MESSAGE: startup-script: ERROR:main:config file not found: /slurm/scripts/config.yaml TIMESTAMP: 2023-07-13T16:06:52.206827654Z
MESSAGE: startup-script: WARNING:main:/slurm/scripts/config.yaml not found TIMESTAMP: 2023-07-13T16:06:52.207785960Z
MESSAGE: startup-script: INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0 TIMESTAMP: 2023-07-13T16:06:52.628300456Z
MESSAGE: startup-script: ERROR: Error while getting metadata from http://metadata.google.internal/computeMetadata/v1/project/attributes/hpcsmall-slurm-devel TIMESTAMP: 2023-07-13T16:06:52.853972964Z
MESSAGE: startup-script: INFO: Setting up compute TIMESTAMP: 2023-07-13T16:06:52.857948832Z
MESSAGE: startup-script: INFO: installing custom scripts: TIMESTAMP: 2023-07-13T16:06:52.861560091Z
MESSAGE: startup-script: INFO: Set up network storage TIMESTAMP: 2023-07-13T16:06:52.897747198Z
MESSAGE: startup-script: INFO: Setting up mount (nfs) 10.165.1.2:/nfsshare to /home TIMESTAMP: 2023-07-13T16:06:52.897797956Z
MESSAGE: startup-script: INFO: Setting up mount (nfs) hpcsmall-controller:/usr/local/etc/slurm to /usr/local/etc/slurm TIMESTAMP: 2023-07-13T16:06:52.897816387Z
MESSAGE: startup-script: INFO: Setting up mount (nfs) hpcsmall-controller:/etc/munge to /etc/munge TIMESTAMP: 2023-07-13T16:06:52.897829486Z
MESSAGE: startup-script: INFO: Setting up mount (nfs) hpcsmall-controller:/opt/apps to /opt/apps TIMESTAMP: 2023-07-13T16:06:52.897856954Z
MESSAGE: startup-script: DEBUG:
MESSAGE: startup-script: Traceback (most recent call last): TIMESTAMP: 2023-07-13T16:06:52.946422935Z
MESSAGE: startup-script: File "/usr/local/lib/python3.6/site-packages/more_executors/_impl/metrics/init.py", line 15, in
MESSAGE: startup-script: from .prometheus import PrometheusMetrics TIMESTAMP: 2023-07-13T16:06:52.946450616Z
MESSAGE: startup-script: File "/usr/local/lib/python3.6/site-packages/more_executors/_impl/metrics/prometheus.py", line 3, in
MESSAGE: startup-script: import prometheus_client # pylint: disable=import-error TIMESTAMP: 2023-07-13T16:06:52.946475497Z
MESSAGE: startup-script: ModuleNotFoundError: No module named 'prometheus_client' TIMESTAMP: 2023-07-13T16:06:52.946488674Z
MESSAGE: startup-script: INFO: Waiting for '/home' to be mounted... TIMESTAMP: 2023-07-13T16:06:53.053302125Z
MESSAGE: startup-script: INFO: Waiting for '/usr/local/etc/slurm' to be mounted... TIMESTAMP: 2023-07-13T16:06:53.056804451Z
MESSAGE: startup-script: INFO: Waiting for '/etc/munge' to be mounted... TIMESTAMP: 2023-07-13T16:06:53.063098357Z
MESSAGE: startup-script: INFO: Waiting for '/opt/apps' to be mounted... TIMESTAMP: 2023-07-13T16:06:53.067950750Z
MESSAGE: startup-script: INFO: Mount point '/opt/apps' was mounted. TIMESTAMP: 2023-07-13T16:06:53.351928114Z
MESSAGE: startup-script: INFO: Mount point '/etc/munge' was mounted. TIMESTAMP: 2023-07-13T16:06:53.356839746Z
MESSAGE: startup-script: INFO: Mount point '/usr/local/etc/slurm' was mounted. TIMESTAMP: 2023-07-13T16:06:53.362189366Z
MESSAGE: startup-script: INFO: Mount point '/home' was mounted. TIMESTAMP: 2023-07-13T16:06:56.358536552Z issharif_c@cloudshell:~ (prj-n-005-cloudops-618d)$
I ran the following command
gcloud logging --project prj-n-005-cloudops-618d read 'resource.labels.instance_id="3405041953608146457"' --format="table(timestamp, jsonPayload.message)" --freshness 24h | tac
Output is too long, putting the parts I feel having a clue for you.
NB: the vm has internet but the ICMP is blocked.
MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 systemd-hostnamed: Changed static host name to 'hpcsmall-debug-ghpc-0' TIMESTAMP: 2023-07-13T16:06:33.900240788Z
MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 NetworkManager[497]:
MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 NetworkManager[497]:
MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 nm-dispatcher: req:5 'hostname': new request (4 scripts) TIMESTAMP: 2023-07-13T16:06:33.900241269Z
MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 systemd-hostnamed: Changed host name to 'hpcsmall-debug-ghpc-0' TIMESTAMP: 2023-07-13T16:06:33.900241388Z
MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 nm-dispatcher: req:6 'hostname': new request (4 scripts) TIMESTAMP: 2023-07-13T16:06:33.900241541Z
MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 nm-dispatcher: req:4 'connectivity-change': start running ordered scripts... TIMESTAMP: 2023-07-13T16:06:33.900241622Z
MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 nm-dispatcher: req:5 'hostname': start running ordered scripts... TIMESTAMP: 2023-07-13T16:06:33.900241739Z
MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 nm-dispatcher: req:6 'hostname': start running ordered scripts... TIMESTAMP: 2023-07-13T16:06:33.900241849Z
MESSAGE: Jul 13 16:06:30 hpcsmall-debug-ghpc-0 network: Bringing up loopback interface: [ OK ] TIMESTAMP: 2023-07-13T16:06:33.900241959Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 network: Bringing up interface eth0: [ OK ] TIMESTAMP: 2023-07-13T16:06:33.900242069Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Started LSB: Bring up/down networking. TIMESTAMP: 2023-07-13T16:06:33.900242176Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Reached target Network. TIMESTAMP: 2023-07-13T16:06:33.900242301Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting Dynamic System Tuning Daemon... TIMESTAMP: 2023-07-13T16:06:33.900242407Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Reached target Network is Online. TIMESTAMP: 2023-07-13T16:06:33.900242519Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting System Logging Service... TIMESTAMP: 2023-07-13T16:06:33.900242630Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting Google Cloud Ops Agent - Logging Agent... TIMESTAMP: 2023-07-13T16:06:33.900242748Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting Google Cloud Ops Agent - Metrics Agent... TIMESTAMP: 2023-07-13T16:06:33.900242874Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting NFS Mount Daemon... TIMESTAMP: 2023-07-13T16:06:33.900242981Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting NFS status monitor for NFSv2/3 locking.... TIMESTAMP: 2023-07-13T16:06:33.900243099Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Started Google OSConfig Agent. TIMESTAMP: 2023-07-13T16:06:33.900243272Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Reached target Login Prompts. TIMESTAMP: 2023-07-13T16:06:33.900243429Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting Postfix Mail Transport Agent... TIMESTAMP: 2023-07-13T16:06:33.900243536Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting MUNGE authentication service... TIMESTAMP: 2023-07-13T16:06:33.900243656Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting Google Compute Engine Guest Agent... TIMESTAMP: 2023-07-13T16:06:33.900243863Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 rpc.statd[887]: Version 1.3.0 starting TIMESTAMP: 2023-07-13T16:06:33.900243985Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 rpc.statd[887]: Flags: TI-RPC TIMESTAMP: 2023-07-13T16:06:33.900244156Z
MESSAGE: GCE Agent Started (version 20220713.00) TIMESTAMP: 2023-07-13T16:06:34.298349809Z
MESSAGE: Instance ID changed, running first-boot actions TIMESTAMP: 2023-07-13T16:06:34.557871751Z
MESSAGE: OSConfig Agent (version 20220824.00-g1.el7) started. TIMESTAMP: 2023-07-13T16:06:34.728212732Z
MESSAGE: Enabling OS Login TIMESTAMP: 2023-07-13T16:06:34.946793779Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 rsyslogd: [origin software="rsyslogd" swVersion="8.24.0-57.el7_9.3" x-pid="872" x-info="http://www.rsyslog.com"] start TIMESTAMP: 2023-07-13T16:06:35.131890351Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Started System Logging Service. TIMESTAMP: 2023-07-13T16:06:35.131896822Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Starting Google Compute Engine Shutdown Scripts... TIMESTAMP: 2023-07-13T16:06:35.131897135Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 munged: munged: Error: Failed to check keyfile "/etc/munge/munge.key": No such file or directory TIMESTAMP: 2023-07-13T16:06:35.131897340Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: munge.service: control process exited, code=exited status=1 TIMESTAMP: 2023-07-13T16:06:35.131897513Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Failed to start MUNGE authentication service. TIMESTAMP: 2023-07-13T16:06:35.131897708Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Unit munge.service entered failed state. TIMESTAMP: 2023-07-13T16:06:35.131897871Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: munge.service failed. TIMESTAMP: 2023-07-13T16:06:35.131898049Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 systemd: Started Google Compute Engine Shutdown Scripts. TIMESTAMP: 2023-07-13T16:06:35.131898183Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: 2023/07/13 16:06:31 Built-in config: TIMESTAMP: 2023-07-13T16:06:35.131898414Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: logging: TIMESTAMP: 2023-07-13T16:06:35.131898605Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: receivers: TIMESTAMP: 2023-07-13T16:06:35.131898779Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: syslog: TIMESTAMP: 2023-07-13T16:06:35.131898951Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: files TIMESTAMP: 2023-07-13T16:06:35.131899127Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: include_paths: TIMESTAMP: 2023-07-13T16:06:35.131899259Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: - /var/log/messages TIMESTAMP: 2023-07-13T16:06:35.131899385Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: - /var/log/syslog TIMESTAMP: 2023-07-13T16:06:35.131899534Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: service: TIMESTAMP: 2023-07-13T16:06:35.131899641Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: pipelines: TIMESTAMP: 2023-07-13T16:06:35.131899738Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: default_pipeline: TIMESTAMP: 2023-07-13T16:06:35.131899879Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: receivers: [syslog] TIMESTAMP: 2023-07-13T16:06:35.131899996Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: metrics: TIMESTAMP: 2023-07-13T16:06:35.131900137Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: receivers: TIMESTAMP: 2023-07-13T16:06:35.131900281Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: hostmetrics: TIMESTAMP: 2023-07-13T16:06:35.131900404Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: hostmetrics TIMESTAMP: 2023-07-13T16:06:35.131900530Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: collection_interval: 60s TIMESTAMP: 2023-07-13T16:06:35.131900706Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: processors: TIMESTAMP: 2023-07-13T16:06:35.131900819Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: metrics_filter: TIMESTAMP: 2023-07-13T16:06:35.131900986Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: exclude_metrics TIMESTAMP: 2023-07-13T16:06:35.131901164Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: metrics_pattern: [] TIMESTAMP: 2023-07-13T16:06:35.131901276Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: service: TIMESTAMP: 2023-07-13T16:06:35.131901424Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: pipelines: TIMESTAMP: 2023-07-13T16:06:35.131901611Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: default_pipeline: TIMESTAMP: 2023-07-13T16:06:35.131901724Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: receivers: [hostmetrics] TIMESTAMP: 2023-07-13T16:06:35.131901839Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: processors: [metrics_filter] TIMESTAMP: 2023-07-13T16:06:35.131901936Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: 2023/07/13 16:06:31 Built-in config: TIMESTAMP: 2023-07-13T16:06:35.131902059Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: logging: TIMESTAMP: 2023-07-13T16:06:35.131902184Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: receivers: TIMESTAMP: 2023-07-13T16:06:35.131902272Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: syslog: TIMESTAMP: 2023-07-13T16:06:35.131902379Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: files TIMESTAMP: 2023-07-13T16:06:35.131902481Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: include_paths: TIMESTAMP: 2023-07-13T16:06:35.131902591Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: - /var/log/messages TIMESTAMP: 2023-07-13T16:06:35.131902706Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: - /var/log/syslog TIMESTAMP: 2023-07-13T16:06:35.131902820Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: service: TIMESTAMP: 2023-07-13T16:06:35.131902940Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: pipelines: TIMESTAMP: 2023-07-13T16:06:35.131903045Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: default_pipeline: TIMESTAMP: 2023-07-13T16:06:35.131903160Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: receivers: [syslog] TIMESTAMP: 2023-07-13T16:06:35.131903267Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: metrics: TIMESTAMP: 2023-07-13T16:06:35.131903369Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: receivers: TIMESTAMP: 2023-07-13T16:06:35.131903479Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: hostmetrics: TIMESTAMP: 2023-07-13T16:06:35.131903574Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: hostmetrics TIMESTAMP: 2023-07-13T16:06:35.131903706Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: collection_interval: 60s TIMESTAMP: 2023-07-13T16:06:35.131903826Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: processors: TIMESTAMP: 2023-07-13T16:06:35.131903969Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: metrics_filter: TIMESTAMP: 2023-07-13T16:06:35.131904082Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: exclude_metrics TIMESTAMP: 2023-07-13T16:06:35.131904193Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: metrics_pattern: [] TIMESTAMP: 2023-07-13T16:06:35.131904289Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: service: TIMESTAMP: 2023-07-13T16:06:35.131904386Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: pipelines: TIMESTAMP: 2023-07-13T16:06:35.131904489Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: default_pipeline: TIMESTAMP: 2023-07-13T16:06:35.131904596Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: receivers: [hostmetrics] TIMESTAMP: 2023-07-13T16:06:35.131904715Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: processors: [metrics_filter] TIMESTAMP: 2023-07-13T16:06:35.131904851Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: 2023/07/13 16:06:31 Merged config: TIMESTAMP: 2023-07-13T16:06:35.131904943Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: logging: TIMESTAMP: 2023-07-13T16:06:35.131905047Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: receivers: TIMESTAMP: 2023-07-13T16:06:35.131905155Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: mysql_error: TIMESTAMP: 2023-07-13T16:06:35.131905308Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: mysql_error TIMESTAMP: 2023-07-13T16:06:35.131905440Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: mysql_general: TIMESTAMP: 2023-07-13T16:06:35.131905587Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: mysql_general TIMESTAMP: 2023-07-13T16:06:35.131906112Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: mysql_slow: TIMESTAMP: 2023-07-13T16:06:35.131906212Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: mysql_slow TIMESTAMP: 2023-07-13T16:06:35.131906361Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: slurm_resume: TIMESTAMP: 2023-07-13T16:06:35.131906524Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: files TIMESTAMP: 2023-07-13T16:06:35.131906624Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: include_paths: TIMESTAMP: 2023-07-13T16:06:35.131906751Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: - /var/log/slurm/resume.log TIMESTAMP: 2023-07-13T16:06:35.131906892Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: slurm_suspend: TIMESTAMP: 2023-07-13T16:06:35.131907046Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: files TIMESTAMP: 2023-07-13T16:06:35.131907193Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: include_paths: TIMESTAMP: 2023-07-13T16:06:35.131907618Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: - /var/log/slurm/suspend.log TIMESTAMP: 2023-07-13T16:06:35.131907726Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: slurm_sync: TIMESTAMP: 2023-07-13T16:06:35.131907876Z
MESSAGE: Jul 13 16:06:31 hpcsmall-debug-ghpc-0 google_cloud_ops_agent_engine: type: files TIMESTAMP: 2023-07-13T16:06:35.131908001Z
MESSAGE: startup-script: failed to ping Google DNS, will retry TIMESTAMP: 2023-07-13T16:06:40.624263840Z
MESSAGE: Jul 13 16:06:40 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: failed to ping Google DNS, will retry TIMESTAMP: 2023-07-13T16:06:40.624930279Z
MESSAGE: startup-script: failed to ping Google DNS, will retry TIMESTAMP: 2023-07-13T16:06:43.627927353Z
MESSAGE: Jul 13 16:06:43 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: failed to ping Google DNS, will retry TIMESTAMP: 2023-07-13T16:06:43.628501787Z
MESSAGE: startup-script: failed to ping Google DNS, will retry TIMESTAMP: 2023-07-13T16:06:46.631542202Z
MESSAGE: Jul 13 16:06:46 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: failed to ping Google DNS, will retry TIMESTAMP: 2023-07-13T16:06:46.632214565Z
MESSAGE: startup-script: failed to ping Google DNS, will retry TIMESTAMP: 2023-07-13T16:06:49.635413099Z
MESSAGE: startup-script: No internet access detected TIMESTAMP: 2023-07-13T16:06:49.635466895Z
MESSAGE: Jul 13 16:06:49 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: failed to ping Google DNS, will retry TIMESTAMP: 2023-07-13T16:06:49.636191831Z
MESSAGE: Jul 13 16:06:49 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: No internet access detected TIMESTAMP: 2023-07-13T16:06:49.636194602Z
MESSAGE: startup-script: curl: (22) The requested URL returned error: 404 Not Found TIMESTAMP: 2023-07-13T16:06:49.707748850Z
MESSAGE: Jul 13 16:06:49 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: curl: (22) The requested URL returned error: 404 Not Found TIMESTAMP: 2023-07-13T16:06:49.708601135Z
MESSAGE: startup-script: hpcsmall-slurm-devel not found in project metadata, skipping script update TIMESTAMP: 2023-07-13T16:06:49.708746575Z
MESSAGE: Jul 13 16:06:49 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: hpcsmall-slurm-devel not found in project metadata, skipping script update TIMESTAMP: 2023-07-13T16:06:49.709074158Z
MESSAGE: startup-script: running python cluster setup script TIMESTAMP: 2023-07-13T16:06:49.710040510Z
MESSAGE: Jul 13 16:06:49 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: running python cluster setup script TIMESTAMP: 2023-07-13T16:06:49.710338610Z
MESSAGE: startup-script: INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0 TIMESTAMP: 2023-07-13T16:06:52.116983239Z
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0 TIMESTAMP: 2023-07-13T16:06:52.117566982Z
MESSAGE: startup-script: ERROR:main:config file not found: /slurm/scripts/config.yaml TIMESTAMP: 2023-07-13T16:06:52.206827654Z
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: ERROR:main:config file not found: /slurm/scripts/config.yaml TIMESTAMP: 2023-07-13T16:06:52.207516329Z
MESSAGE: startup-script: WARNING:main:/slurm/scripts/config.yaml not found TIMESTAMP: 2023-07-13T16:06:52.207785960Z
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: WARNING:main:/slurm/scripts/config.yaml not found TIMESTAMP: 2023-07-13T16:06:52.208123517Z
MESSAGE: startup-script: INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0 TIMESTAMP: 2023-07-13T16:06:52.628300456Z
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0 TIMESTAMP: 2023-07-13T16:06:52.628872672Z
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 wall[1596]: wall: user root broadcasted 1 lines (64 chars) TIMESTAMP: 2023-07-13T16:06:52.834119224Z
MESSAGE: startup-script: ERROR: Error while getting metadata from http://metadata.google.internal/computeMetadata/v1/project/attributes/hpcsmall-slurm-devel TIMESTAMP: 2023-07-13T16:06:52.853972964Z
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: ERROR: Error while getting metadata from http://metadata.google.internal/computeMetadata/v1/project/attributes/hpcsmall-slurm-devel TIMESTAMP: 2023-07-13T16:06:52.854476176Z
MESSAGE: startup-script: INFO: Setting up compute TIMESTAMP: 2023-07-13T16:06:52.857948832Z
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Setting up compute TIMESTAMP: 2023-07-13T16:06:52.858322163Z
MESSAGE: startup-script: INFO: installing custom scripts: TIMESTAMP: 2023-07-13T16:06:52.861560091Z
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: installing custom scripts: TIMESTAMP: 2023-07-13T16:06:52.861886268Z
MESSAGE: startup-script: INFO: Set up network storage TIMESTAMP: 2023-07-13T16:06:52.897747198Z
MESSAGE: startup-script: INFO: Setting up mount (nfs) 10.165.1.2:/nfsshare to /home TIMESTAMP: 2023-07-13T16:06:52.897797956Z
MESSAGE: startup-script: INFO: Setting up mount (nfs) hpcsmall-controller:/usr/local/etc/slurm to /usr/local/etc/slurm TIMESTAMP: 2023-07-13T16:06:52.897816387Z
MESSAGE: startup-script: INFO: Setting up mount (nfs) hpcsmall-controller:/etc/munge to /etc/munge TIMESTAMP: 2023-07-13T16:06:52.897829486Z
MESSAGE: startup-script: INFO: Setting up mount (nfs) hpcsmall-controller:/opt/apps to /opt/apps TIMESTAMP: 2023-07-13T16:06:52.897856954Z
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Set up network storage TIMESTAMP: 2023-07-13T16:06:52.898968986Z
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Setting up mount (nfs) 10.165.1.2:/nfsshare to /home TIMESTAMP: 2023-07-13T16:06:52.898971584Z
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Setting up mount (nfs) hpcsmall-controller:/usr/local/etc/slurm to /usr/local/etc/slurm TIMESTAMP: 2023-07-13T16:06:52.898971871Z
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Setting up mount (nfs) hpcsmall-controller:/etc/munge to /etc/munge TIMESTAMP: 2023-07-13T16:06:52.898972108Z
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Setting up mount (nfs) hpcsmall-controller:/opt/apps to /opt/apps TIMESTAMP: 2023-07-13T16:06:52.898972441Z
MESSAGE: startup-script: DEBUG:
MESSAGE: startup-script: Traceback (most recent call last): TIMESTAMP: 2023-07-13T16:06:52.946422935Z
MESSAGE: startup-script: File "/usr/local/lib/python3.6/site-packages/more_executors/_impl/metrics/init.py", line 15, in
MESSAGE: startup-script: from .prometheus import PrometheusMetrics TIMESTAMP: 2023-07-13T16:06:52.946450616Z
MESSAGE: startup-script: File "/usr/local/lib/python3.6/site-packages/more_executors/_impl/metrics/prometheus.py", line 3, in
MESSAGE: startup-script: import prometheus_client # pylint: disable=import-error TIMESTAMP: 2023-07-13T16:06:52.946475497Z
MESSAGE: startup-script: ModuleNotFoundError: No module named 'prometheus_client' TIMESTAMP: 2023-07-13T16:06:52.946488674Z
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: DEBUG:
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: Traceback (most recent call last): TIMESTAMP: 2023-07-13T16:06:52.947804992Z
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: File "/usr/local/lib/python3.6/site-packages/more_executors/_impl/metrics/init.py", line 15, in
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: from .prometheus import PrometheusMetrics TIMESTAMP: 2023-07-13T16:06:52.947805470Z
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: File "/usr/local/lib/python3.6/site-packages/more_executors/_impl/metrics/prometheus.py", line 3, in
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: import prometheus_client # pylint: disable=import-error TIMESTAMP: 2023-07-13T16:06:52.947805920Z
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: ModuleNotFoundError: No module named 'prometheus_client' TIMESTAMP: 2023-07-13T16:06:52.947806110Z
MESSAGE: startup-script: INFO: Waiting for '/home' to be mounted... TIMESTAMP: 2023-07-13T16:06:53.053302125Z
MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Waiting for '/home' to be mounted... TIMESTAMP: 2023-07-13T16:06:53.053821404Z
MESSAGE: startup-script: INFO: Waiting for '/usr/local/etc/slurm' to be mounted... TIMESTAMP: 2023-07-13T16:06:53.056804451Z
MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Waiting for '/usr/local/etc/slurm' to be mounted... TIMESTAMP: 2023-07-13T16:06:53.057254745Z
MESSAGE: startup-script: INFO: Waiting for '/etc/munge' to be mounted... TIMESTAMP: 2023-07-13T16:06:53.063098357Z
MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Waiting for '/etc/munge' to be mounted... TIMESTAMP: 2023-07-13T16:06:53.063471884Z
MESSAGE: startup-script: INFO: Waiting for '/opt/apps' to be mounted... TIMESTAMP: 2023-07-13T16:06:53.067950750Z
MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Waiting for '/opt/apps' to be mounted... TIMESTAMP: 2023-07-13T16:06:53.068295350Z
MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 kernel: FS-Cache: Loaded TIMESTAMP: 2023-07-13T16:06:53.141143386Z
MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 kernel: FS-Cache: Netfs 'nfs' registered for caching TIMESTAMP: 2023-07-13T16:06:53.220549264Z
MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 kernel: Key type dns_resolver registered TIMESTAMP: 2023-07-13T16:06:53.259145310Z
MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 kernel: NFS: Registering the id_resolver key type TIMESTAMP: 2023-07-13T16:06:53.292165653Z
MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 kernel: Key type id_resolver registered TIMESTAMP: 2023-07-13T16:06:53.292169528Z
MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 kernel: Key type id_legacy registered TIMESTAMP: 2023-07-13T16:06:53.297774099Z
MESSAGE: startup-script: INFO: Mount point '/opt/apps' was mounted. TIMESTAMP: 2023-07-13T16:06:53.351928114Z
MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Mount point '/opt/apps' was mounted. TIMESTAMP: 2023-07-13T16:06:53.352571723Z
MESSAGE: startup-script: INFO: Mount point '/etc/munge' was mounted. TIMESTAMP: 2023-07-13T16:06:53.356839746Z
MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Mount point '/etc/munge' was mounted. TIMESTAMP: 2023-07-13T16:06:53.357466454Z
MESSAGE: startup-script: INFO: Mount point '/usr/local/etc/slurm' was mounted. TIMESTAMP: 2023-07-13T16:06:53.362189366Z
MESSAGE: Jul 13 16:06:53 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Mount point '/usr/local/etc/slurm' was mounted. TIMESTAMP: 2023-07-13T16:06:53.362728110Z
MESSAGE: startup-script: INFO: Mount point '/home' was mounted. TIMESTAMP: 2023-07-13T16:06:56.358536552Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Mount point '/home' was mounted. TIMESTAMP: 2023-07-13T16:06:56.359568384Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: DEBUG: run_custom_scripts: custom scripts to run: /slurm/custom_scripts/() TIMESTAMP: 2023-07-13T16:06:56.599696464Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: Starting MUNGE authentication service... TIMESTAMP: 2023-07-13T16:06:56.633712278Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: Started MUNGE authentication service. TIMESTAMP: 2023-07-13T16:06:56.681955497Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: Reloading. TIMESTAMP: 2023-07-13T16:06:56.699265986Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service:23] Unknown lvalue 'StateDirectory' in section 'Service' TIMESTAMP: 2023-07-13T16:06:56.736840989Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service:24] Unknown lvalue 'LogsDirectory' in section 'Service' TIMESTAMP: 2023-07-13T16:06:56.737233032Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service:30] Unknown lvalue 'RuntimeDirectoryPreserve' in section 'Service' TIMESTAMP: 2023-07-13T16:06:56.737609049Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service:23] Unknown lvalue 'StateDirectory' in section 'Service' TIMESTAMP: 2023-07-13T16:06:56.738059961Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service:24] Unknown lvalue 'LogsDirectory' in section 'Service' TIMESTAMP: 2023-07-13T16:06:56.738410537Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service:30] Unknown lvalue 'RuntimeDirectoryPreserve' in section 'Service' TIMESTAMP: 2023-07-13T16:06:56.738700889Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: Started Slurm node daemon. TIMESTAMP: 2023-07-13T16:06:56.769598605Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: Reloading. TIMESTAMP: 2023-07-13T16:06:56.819583430Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service:23] Unknown lvalue 'StateDirectory' in section 'Service' TIMESTAMP: 2023-07-13T16:06:56.851831316Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service:24] Unknown lvalue 'LogsDirectory' in section 'Service' TIMESTAMP: 2023-07-13T16:06:56.852293452Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service:30] Unknown lvalue 'RuntimeDirectoryPreserve' in section 'Service' TIMESTAMP: 2023-07-13T16:06:56.852662558Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service:23] Unknown lvalue 'StateDirectory' in section 'Service' TIMESTAMP: 2023-07-13T16:06:56.853097143Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service:24] Unknown lvalue 'LogsDirectory' in section 'Service' TIMESTAMP: 2023-07-13T16:06:56.853522462Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: [/usr/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service:30] Unknown lvalue 'RuntimeDirectoryPreserve' in section 'Service' TIMESTAMP: 2023-07-13T16:06:56.853828043Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 systemd: Started Slurm Cluster Event Daemon. TIMESTAMP: 2023-07-13T16:06:56.884289205Z
MESSAGE: Jul 13 16:06:56 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Check status of cluster services TIMESTAMP: 2023-07-13T16:06:56.916557223Z
MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: INFO: Done setting up compute TIMESTAMP: 2023-07-13T16:06:57.082362712Z
MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 wall[1712]: wall: user root broadcasted 1 lines (38 chars) TIMESTAMP: 2023-07-13T16:06:57.087527857Z
MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 wall[1714]: wall: user root broadcasted 4 lines (118 chars) TIMESTAMP: 2023-07-13T16:06:57.097393110Z
MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 slurmd: slurmd: slurmd version 22.05.4 started TIMESTAMP: 2023-07-13T16:06:57.179327190Z
MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script exit status 0 TIMESTAMP: 2023-07-13T16:06:57.264507543Z
MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 google_metadata_script_runner: Finished running startup scripts. TIMESTAMP: 2023-07-13T16:06:57.264983531Z
MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 systemd: Started Google Compute Engine Startup Scripts. TIMESTAMP: 2023-07-13T16:06:57.269071781Z
MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 systemd: Reached target Multi-User System. TIMESTAMP: 2023-07-13T16:06:57.269073565Z
MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 systemd: Starting Update UTMP about System Runlevel Changes... TIMESTAMP: 2023-07-13T16:06:57.269073886Z
MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 systemd: Started Update UTMP about System Runlevel Changes. TIMESTAMP: 2023-07-13T16:06:57.284062159Z
MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 systemd: Startup finished in 642ms (kernel) + 2.765s (initrd) + 33.008s (userspace) = 36.416s. TIMESTAMP: 2023-07-13T16:06:57.284064657Z
MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 slurmeventd.py: INFO:googleapiclient.discovery_cache:file_cache is only supported with oauth2client<4.0.0 TIMESTAMP: 2023-07-13T16:06:57.367946810Z
MESSAGE: Jul 13 16:06:57 hpcsmall-debug-ghpc-0 slurmd: slurmd: CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=7818 TmpDisk=50988 Uptime=37 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) TIMESTAMP: 2023-07-13T16:06:57.378436736Z
MESSAGE: Jul 13 16:06:58 hpcsmall-debug-ghpc-0 slurmd: slurmd: launch task StepId=25.0 request from UID:2099065396 GID:2099065396 HOST:10.161.0.60 PORT:50428 TIMESTAMP: 2023-07-13T16:06:58.491820935Z
MESSAGE: Jul 13 16:06:58 hpcsmall-debug-ghpc-0 slurmd: slurmd: task/affinity: lllp_distribution: JobId=25 implicit auto binding: sockets,one_thread, dist 8192 TIMESTAMP: 2023-07-13T16:06:58.491826766Z
MESSAGE: Jul 13 16:06:58 hpcsmall-debug-ghpc-0 slurmd: slurmd: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic TIMESTAMP: 2023-07-13T16:06:58.491827112Z
MESSAGE: Jul 13 16:06:58 hpcsmall-debug-ghpc-0 slurmd: slurmd: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [25]: mask_cpu,one_thread, 0x1 TIMESTAMP: 2023-07-13T16:06:58.491827397Z
MESSAGE: TIMESTAMP: 2023-07-13T16:06:59.540723Z
MESSAGE: Jul 13 16:07:00 hpcsmall-debug-ghpc-0 systemd-logind: Power key pressed. TIMESTAMP: 2023-07-13T16:07:00.302885280Z
MESSAGE: Jul 13 16:07:00 hpcsmall-debug-ghpc-0 systemd-logind: Powering Off... TIMESTAMP: 2023-07-13T16:07:00.302888277Z
MESSAGE: Jul 13 16:07:00 hpcsmall-debug-ghpc-0 systemd-logind: System is powering down. TIMESTAMP: 2023-07-13T16:07:00.302888589Z
MESSAGE: TIMESTAMP: 2023-07-13T16:07:00.465454Z
MESSAGE: TIMESTAMP: 2023-07-13T16:07:04.248951070Z
MESSAGE: TIMESTAMP: 2023-07-13T16:07:14.815270Z
MESSAGE: TIMESTAMP: 2023-07-13T16:07:14.815774Z
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: ERROR:main:config file not found: /slurm/scripts/config.yaml
Is the location wrong or the config file should be here but it is?
I realize there are a lot of warnings in there but many of them are retries.
MESSAGE: Jul 13 16:06:58 hpcsmall-debug-ghpc-0 slurmd: slurmd: launch task StepId=25.0 request from UID:2099065396 GID:2099065396 HOST:10.161.0.60 PORT:50428
TIMESTAMP: 2023-07-13T16:06:58.491820935Z
MESSAGE: Jul 13 16:06:58 hpcsmall-debug-ghpc-0 slurmd: slurmd: task/affinity: lllp_distribution: JobId=25 implicit auto binding: sockets,one_thread, dist 8192
TIMESTAMP: 2023-07-13T16:06:58.491826766Z
MESSAGE: Jul 13 16:06:58 hpcsmall-debug-ghpc-0 slurmd: slurmd: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
TIMESTAMP: 2023-07-13T16:06:58.491827112Z
MESSAGE: Jul 13 16:06:58 hpcsmall-debug-ghpc-0 slurmd: slurmd: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [25]: mask_cpu,one_thread, 0x1
TIMESTAMP: 2023-07-13T16:06:58.491827397Z
This reads as though job 25 may have matched, execute, and finished. I will confirm by looking at other logs on a Slurm cluster I provisioned.
What is odd is that you are using the debug partition (configured with "exclusive: false") that should cause it to remain powered on for several minutes after a job completes. Did you alter any settings of hpc-slurm.yaml?
This error:
MESSAGE: Jul 13 16:06:52 hpcsmall-debug-ghpc-0 google_metadata_script_runner: startup-script: ERROR:main:config file not found: /slurm/scripts/config.yaml
and the Slurm version "22.05.4" are both leaping out at me. If you are running the tutorial from the most recent commit on main, you should have version v1.20.0
which would provision a node with 22.05.9. git log
would begin with this:
commit 252694acbe160611948341ba24f6e010539cfa52 (HEAD -> main, tag: v1.20.0, upstream/main, origin/main, origin/HEAD)
Did you run this tutorial a while back and are coming back to it? You might start with a git pull
while on the main branch.
Hi Tom
Yes I deployed it 3 months ago and now the team wants to use it. Ok I am getting the latest version. Thank you very much.
Regards Sharif
The crux of the matter is this error. That would be fatal (the Slurm machine boots up but can't configure itself)
google_metadata_script_runner: startup-script: ERROR:main:config file not found: /slurm/scripts/config.yaml
I think there may have been a quickly-fixed bug that would result in this error a couple months ago. If you run srun -N3 hostname
on the latest release, you should observe:
hostname
job runs quicklyPlease open a new issue if you do not see that. Thanks!
Hi
I deployed the 1.20 version and getting the following error.
[issharif_c_cameco_com@hpcsmall-login-vicyomx9-001 ~]$ srun -N3 hostname srun: error: Node failure on hpcsmall-debug-ghpc-0 srun: error: Nodes hpcsmall-debug-ghpc-[0-2] are still not ready srun: error: Something is wrong with the boot of the nodes. [issharif_c_cameco_com@hpcsmall-login-vicyomx9-001 ~]$
Regards
Hi
I ran the command "srun -N 1 hostname" in the login node. It tries to spin up the vm "hpcsmall-debug-ghpc-0" but fails to create it. I have found an error in gcp log with message: "Error removing user: mkdir /home/packer/.ssh: no such file or directory."
Full error details are provided hereafter.
{ insertId: "1kw5o1beyxbxu" jsonPayload: { localTimestamp: "2023-07-13T16:06:35.1660Z" message: "Error removing user: mkdir /home/packer/.ssh: no such file or directory." omitempty: null } labels: { instance_name: "hpcsmall-debug-ghpc-0" } logName: "projects/prj-n-005-cloudops-618d/logs/GCEGuestAgent" receiveTimestamp: "2023-07-13T16:06:35.710523191Z" resource: { labels: { instance_id: "3405041953608146457" (instance_name: hpcsmall-debug-ghpc-0) project_id: "prj-n-005-cloudops-618d" zone: "northamerica-northeast1-c" } type: "gce_instance" } severity: "ERROR" sourceLocation: { file: "non_windows_accounts.go" function: "main.(*accountsMgr).set" line: "161" } timestamp: "2023-07-13T16:06:35.166029824Z" }
We deployed a SLURM cluster from the following link https://cloud.google.com/hpc-toolkit/docs/quickstarts/slurm-cluster