GoogleCloudPlatform / google-cloud-ops-agents-ansible

Ansible Role for Google Cloud Ops
https://cloud.google.com/products/operations
Apache License 2.0
99 stars 55 forks source link

Support Ops Agent 2.x.x versions #64

Closed jsirianni closed 3 years ago

jsirianni commented 3 years ago

When testing the role against Ops Agent 2.0.0.

fatal: [10.33.104.160]: FAILED! => {"changed": false, "msg": "Could not find the requested service google-cloud-ops-agent.target: host"}

When using an older version, 1.0.5, the role completes without error.

qingling128 commented 3 years ago

One of the breaking changes with 2.0.0 is that we replaced google-cloud-ops-agent.target with google-cloud-ops-agent.service: https://github.com/GoogleCloudPlatform/ops-agent/pull/119. See https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent/transition#commands for the new commands to use.

qingling128 commented 3 years ago

Added a compatibility matrix: https://github.com/GoogleCloudPlatform/google-cloud-ops-agents-ansible/pull/65. We'll need to prioritize the work to support Ops Agent 2.0.0 with this role next.

rmoriar1 commented 3 years ago

Is conditionally setting the service name based on version the only change that's needed to be compatible with 2.x.x?

jsirianni commented 3 years ago

I have a fork I am working with right now, the only change I have made is in vars/main.yml

-ops-agent_service_name: google-cloud-ops-agent.target
+ops-agent_service_name: google-cloud-ops-agent

It should also be noted that the ops agent service does not remain running anymore. Its purpose is to start the fluentbit and open telemetry agents, and then exit.

root@agent2:~# systemctl status google-cloud-ops-agent
● google-cloud-ops-agent.service - Google Cloud Ops Agent
   Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; ve
   Active: inactive (dead) since Thu 2021-07-01 17:02:49 UTC; 36min ago
  Process: 8890 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
 Main PID: 8890 (code=exited, status=0/SUCCESS)

Jul 01 17:02:49 agent2 systemd[1]: Starting Google Cloud Ops Agent...
Jul 01 17:02:49 agent2 systemd[1]: google-cloud-ops-agent.service: Succeeded.
Jul 01 17:02:49 agent2 systemd[1]: Started Google Cloud Ops Agent.

The service file looks like this

# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

[Unit]
Description=Google Cloud Ops Agent
Requires=google-cloud-ops-agent-fluent-bit.service google-cloud-ops-agent-opentelemetry-collector.service

[Service]
Type=oneshot
ExecStart=/bin/true

[Install]
WantedBy=multi-user.target

I suspect the test cases will need to consider this.

qingling128 commented 3 years ago

It should also be noted that the ops agent service does not remain running anymore. Its purpose is to start the fluentbit and open telemetry agents, and then exit.

Yes, that's the current behavior. We are still discussing internally to improve that UX. The fact that the root service shows as inactive (dead) could be confusing to users.

BTW, does Ansible / Puppet have to know the internals of the agents? We added install / upgrade / uninstall / versioning features to the installation scripts, hoping that for cases like this, we only need to update the installation script, and leave the Ansible / Puppet / Chef / Saltstack implementation untouched. The agent restart case was not supported yet, but we could definitely add it to the installation script. aka it will handle the conditions of which command to use when restarting the agent based on the agent version.

jsirianni commented 3 years ago

So far, the only change required to Ansible and Puppet have been changing the service name (removing .target). The test cases fail because the service is expected to be running.

I am working on modifying the Ansible test cases to consider the ops-agent major version when deciding which service(s) should be running.