Stackdriver / collectd

Stackdriver's monitoring agent based on collectd (http://collectd.org).
https://cloud.google.com/monitoring/agent/
Other
51 stars 15 forks source link

Bring the Stackdriver fork up-to-date with collectd-5.8.1 #150

Closed igorpeshansky closed 5 years ago

igorpeshansky commented 5 years ago

This PR will not be merged — it's just there to examine the changes from stackdriver-agent-5.5.2 that were rebased on top of the upstream collectd-5.8.1 tag. Once we're satisfied with the changes, we will rename the PR branch to stackdriver-agent-5.8.1 and close the PR.

I've tried to keep the commit attributions and dates (for both author and committer) where feasible. I've also squashed some fixups into the original commits and dropped applied/reverted pairs of commits.

This builds on and supersedes #143.

googlebot commented 5 years ago

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

googlebot commented 5 years ago

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

igorpeshansky commented 5 years ago

All commits in this PR were made by folks who were at Google at the time of the contribution.

googlebot commented 5 years ago

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

googlebot commented 5 years ago

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

kkm000 commented 5 years ago

@igorpeshansky, I've built the agent from this PR's branch for Debian 10 (it's not yet supported, but I am running it), and, running on a machine without swap file in the default configuration, I am seeing this on agent's startup

Jul 31 19:20:36 xa-test-ag collectd[1658]: write_gcm: Asking metadata server for auth token
Jul 31 19:20:36 xa-test-ag collectd[1658]: write_gcm: can not take infinite value
Jul 31 19:20:36 xa-test-ag collectd[1658]: write_gcm: wg_typed_value_create_from_value_t_inline failed for swap/percent/value! Continuing.
Jul 31 19:20:36 xa-test-ag collectd[1658]: write_gcm: can not take infinite value
Jul 31 19:20:36 xa-test-ag collectd[1658]: write_gcm: wg_typed_value_create_from_value_t_inline failed for swap/percent/value! Continuing.
Jul 31 19:20:36 xa-test-ag collectd[1658]: write_gcm: can not take infinite value
Jul 31 19:20:36 xa-test-ag collectd[1658]: write_gcm: wg_typed_value_create_from_value_t_inline failed for swap/percent/value! Continuing.
Jul 31 19:20:36 xa-test-ag collectd[1658]: write_gcm: Server response (CollectdTimeseriesRequest) contains errors:
                                           {
                                             "payloadErrors": [
                                               {
                                                 "valueErrors": [
                                                   {
                                                     "error": {
                                                       "code": 3,
                                                       "message": "Points must be written in order. One or more of the points specified had an older end time than the most rec
                                                     }
                                                   }
                                                 ]
                                               },
                                               {
                                                 "index": 1,
                                                 "valueErrors": [
                                                   {
                                                     "error": {
                                                       "code": 3,
                                                       "message": "Points must be written in order. One or more of the points specified had an older end time than the most rec
                                                     }
                                                   }
                                                 ]
                                               },
                                               {
                                                 "index": 2,
                                                 "valueErrors": [
                                                   {
                                                     "error": {
                                                       "code": 3,
                                                       "message": "Points must be written in order. One or more of the points specified had an older end time than the most rec
                                                     }
                                                   }
                                                 ]
                                               },
                                               {
                                                 "index": 3,
                                                 "valueErrors": [
                                                   {
                                                     "error": {
                                                       "code": 3,

(the error is truncated in the log as quoted), and the first few error lines repeat every minute.

Jul 31 19:22:36 xa-test-ag collectd[1658]: write_gcm: wg_typed_value_create_from_value_t_inline failed for swap/percent/value! Continuing.
Jul 31 19:22:36 xa-test-ag collectd[1658]: write_gcm: can not take infinite value
Jul 31 19:22:36 xa-test-ag collectd[1658]: write_gcm: wg_typed_value_create_from_value_t_inline failed for swap/percent/value! Continuing.
Jul 31 19:22:36 xa-test-ag collectd[1658]: write_gcm: can not take infinite value
Jul 31 19:22:36 xa-test-ag collectd[1658]: write_gcm: wg_typed_value_create_from_value_t_inline failed for swap/percent/value! Continuing.
Jul 31 19:22:36 xa-test-ag collectd[1658]: write_gcm: can not take infinite value
Jul 31 19:22:36 xa-test-ag collectd[1658]: write_gcm: wg_typed_value_create_from_value_t_inline failed for swap/percent/value! Continuing.
Jul 31 19:22:36 xa-test-ag collectd[1658]: write_gcm: can not take infinite value
Jul 31 19:22:36 xa-test-ag collectd[1658]: write_gcm: wg_typed_value_create_from_value_t_inline failed for swap/percent/value! Continuing.
Jul 31 19:22:36 xa-test-ag collectd[1658]: write_gcm: can not take infinite value
Jul 31 19:22:36 xa-test-ag collectd[1658]: write_gcm: wg_typed_value_create_from_value_t_inline failed for swap/percent/value! Continuing.
Jul 31 19:23:36 xa-test-ag collectd[1658]: write_gcm: can not take infinite value
Jul 31 19:23:36 xa-test-ag collectd[1658]: write_gcm: wg_typed_value_create_from_value_t_inline failed for swap/percent/value! Continuing.
Jul 31 19:23:36 xa-test-ag collectd[1658]: write_gcm: can not take infinite value
Jul 31 19:23:36 xa-test-ag collectd[1658]: write_gcm: wg_typed_value_create_from_value_t_inline failed for swap/percent/value! Continuing.
Jul 31 19:23:36 xa-test-ag collectd[1658]: write_gcm: can not take infinite value
Jul 31 19:23:36 xa-test-ag collectd[1658]: write_gcm: wg_typed_value_create_from_value_t_inline failed for swap/percent/value! Continuing.

Disabling the swap plugin eliminates the error, naturally.

Perhaps the plugin should not be loaded by default?

Another possibly notable warning is

Jul 31 19:32:48 xa-test-ag collectd[2162]: set_thread_name("write_gcm consumer"): name too long

I did not test yet if metrics are logged; just posting this immediately as I discovered the problem. I'll let you know how it goes when I complete the testing.

There were a couple of surprises during the build; most notable is the failed configure check for the libssl if using OpenSSL v1.1 (and v1.0 is not available in Debian 10). A fix is to remove the following test, as SSL_library_init is defined as a backward-compat macro now: https://github.com/Stackdriver/collectd/blob/9fecb8fd956c41cebeb4cb36385011b041cdd54c/configure.ac#L5098-L5111

(the same test is present on the current version branch).

Should I report is as a separate issue? Here or in https://github.com/Stackdriver/agent-packaging? I don't think a PR against this branch is a good idea, since you are force-pushing it.

The libssl non-detection is relatively harmless; the write_gcm includes <openssl/...> headers without checking for the HAVE_LIBSSL which remains undefined, so the agent package builds fine in the end even without the fix, and the package depends on libssl via libcurl[34] package indirectly anyway. But it might have been cleaned up a bit.

I'll be happy to share my Debian build notes, too, in case you are planning to create a package for it.

kkm000 commented 5 years ago

It is probable (not 100% sure yet) that the apparent zero-division in the swap computation is a regression introduced upstream.

But realistically, if the user has no swap, they are not interested in the swap metric (which is the trivial "0 out of 0 used" sample), so my take is still just exclude the swap plugin from the default .conf file. It's easier to add plugins to the default configuration by dropping files into the .d directory than to exclude one.

googlebot commented 5 years ago

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

googlebot commented 5 years ago

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

igorpeshansky commented 5 years ago

@kkm000 Thanks for testing. The "name too long" error set_thread_name is a real error, and is fixed in the latest push. We don't normally test on machines with no swap partition (this is the first report we saw of using the agent on a machine without one), so if that's your setup, you would have to disable the swap plugin by editing the main config. As a matter of policy, we don't normally touch the user's config.d directory, to avoid accidentally overwriting something. As for the build errors on Debian 10, note that the agent doesn't yet support that distro. Once we do, we will have a separate control file in Stackdriver/agent-packaging with all the right configure options. I know we added support for OpenSSL v1.1 as part of the changes for Ubuntu Bionic, but they may not have been as clean as we'd wanted them to be. Once that support is added, we'd be happy to review a PR.

kkm000 commented 5 years ago

@igorpeshansky, thanks for your reply. Yup, I know Debian 10 is not supported yet, but the Stretch distro is quite dated, so I had little options to make all our stuff churning. This is why I'm building agent from source in the first place. :) Buster works on GCE quite well, in fact, when using the full (as opposed to the "cloud") kernel.

you would have to disable the swap plugin by editing the main config. As a matter of policy, we don't normally touch the user's config.d directory, to avoid accidentally overwriting something.

Yes, of course; and this is precisely why I mentioned possibly not including the swap plugin into the baseline configuration. AFAIK, collectd does not have a way to disable a plugin loaded earlier in the config file, so, if the swap plugin is loaded in the top-level collectd.conf, the user has no option of dropping any config snippet into the collectd.d directory to undo that. The only option is to edit the top-level, package-managed config, and the change must then be maintained through package upgrades.

This does not affect me personally at all, since I am rolling my own build, and can, and do deploy a modified collectd.conf file. This is rather a general usability point: from the future-proofing perspective, it seems to be less troublesome for the user to add a plugin to the baseline than remove one.

Anyway, this is likely quite a minor point, as the package is old, stable, and updated rarely. Just sharing my thoughts.

Thanks for the fix, by the way! I'll pull the updated source for my next build.

googlebot commented 5 years ago

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

igorpeshansky commented 5 years ago

Accidental push. Reverted now.

googlebot commented 5 years ago

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

igorpeshansky commented 5 years ago

This rebased branch was pushed to stackdriver-agent-5.8.1.