Unable to update python-based check without restart

jar349 commented 6 years ago

Output of the info page (if this is a bug)

The link in the template 404s.  Send me a doc and I'll provide the info page.

Describe what happened: We have a python check that uses the built-in pwd module to lookup an /etc/passwd passwd entry by username (the pwd.getpwnam() method). It is set to lookup a user that does not exist locally in the /etc/passwd database but rather in sssd (linked to LDAP). This tests for us that our sssd is working with our LDAP.

class Sssd(AgentCheck):
    USERNAME = "not-in-etc-password"

    def check_sssd(self):
        try:
            user = pwd.getpwnam(Sssd.USERNAME)
            self.log.info("Successfully looked up user=%s" % Sssd.USERNAME)
            return AgentCheck.OK
        except KeyError,err:
            self.log.error("Failed to look up user=%s, exception=%s" % (Sssd.USERNAME, err))
            return AgentCheck.CRITICAL

    def check(self, instance):
        start_time = time.time()
        sssd_status = self.check_sssd()
        end_time = time.time()

        self.service_check("sssd.up", sssd_status)
        self.gauge("sssd.response_time", end_time - start_time)

With this check in place, it is catching a few times when sssd goes down. The problem is that it stays down even when sssd is back up.

ERROR | (datadog_agent.go:133 in LogMessage) | (sssd.py) | Failed to look up user=not-in-etc-password, exception='getpwnam(): name not found: not-in-etc-password'

even while I use datadog-agent's embedded python to essentially do the same thing:

/opt/datadog-agent/embedded $ bin/python
Python 2.7.14 (default, Apr 13 2018, 09:11:01)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pwd
>>> print pwd.getpwnam("not-in-etc-password")
pwd.struct_passwd(pw_name='not-in-etc-password', pw_passwd='*', pw_uid=1530043775, pw_gid=100, pw_gecos='not-in-etc-password', pw_dir='/home/not-in-etc-password', pw_shell='/bin/false')

Remember, this check used to pass until our sssd went down for a few minutes. So, let's see what's in the entire passwd database, ok? We updated our check to have the following:

    def check_sssd(self):
        try:
            all = pwd.getpwall()
            self.log.info("Entire database: %s" % all)

            user = pwd.getpwnam(Sssd.USERNAME)
            self.log.info("Successfully looked up user=%s" % Sssd.USERNAME)
            return AgentCheck.OK
        except KeyError,err:
            self.log.error("Failed to look up user=%s, exception=%s" % (Sssd.USERNAME, err))
            return AgentCheck.CRITICAL

This check runs once a minute and we let it go for 3 minutes and all the log entries remain exactly the same as before - no entry in the log for the entire database. We notice there's a *.pyc file and think it may be using that instead, so we delete it. No effect.

Of course, we can restart the datadog-agent and not only do we see the new log entry, but the alert clears because the check starts working again. datadog-agent is supposed to be the watcher... we don't want to add another watcher to watch the watcher. Nor are we enthused about a cron job to periodically restart the agent.

Thus we come to you for advice. We don't know how datadog-agent's internals work, nor any python runner you're executing. Perhaps something is being cached? We found this wiki article, which says: The function returns a list of Check instances, can be invoked multiple times and can run on a separate goroutine. . Perhaps you could run this periodically and expose the schedule as a configuration option?

Describe what you expected: We expect our sssd alert to clear when sssd restarts. We expect changes in the python checks to take effect without a restart of the datadog agent because python is interpreted.

Steps to reproduce the issue: See above

Additional environment details (Operating System, Cloud provider, etc):

Debian Jesse instance on AWS
datadog agent 6
sssd 1.16.1

olivielpeau commented 6 years ago

Hi @jar349, thanks for opening this issue.

When the Agent starts, it initializes an embedded python interpreter that's used to run all the python checks. This "instance" of the python interpreter remains loaded in the Agent process until the Agent is stopped.

There are some subtle differences from a "regular" python runtime because the embedded interpreter runtime runs checks on different OS threads, but essentially the embedded python interpreter behaves the same as running /opt/datadog-agent/embedded/bin/python directly.

So, unless the pwd module (and/or the underlying system calls) are sensitive to the OS thread they're called from, the logic that runs your check can be approximately simplified to:

# needs an import of your `Sssd` class
import time

CHECK_INTERVAL = 60 # in seconds
sssd = Sssd()
while True:
    sssd.check({})
    time.sleep(CHECK_INTERVAL)

When directly using one long-lived process of the non-embedded python interpreter (/opt/datadog-agent/embedded/bin/python), does the pwd module allow you to actually detect when sssd is back up after it went down?

jar349 commented 6 years ago

Thanks for your response. We originally had a similar thought: that pwd might be caching or something similar.

So we went to source and found that python's pwd module implementation just calls out to glibc's pwd.h which defines the getpwnam method (no caching). what does glibc's implementation do? It goes to NSS and uses NSS's getXXbyYY.c. From top to bottom, there's nothing fancy in any of those implementations: no caching, just straight lookups.

So what's our /etc/nsswitch.conf you might ask?

passwd:          compat sss

compat just emulates the old-school NIS ability to reference external NIS maps, but we don't even do that in our /etc/passwd. So we temporarily removed compat and there was no change in behavior.

Our best guess right now is that there's some unintended side-effect of how the datadog-agent works. Our best evidence is that you're loading certain things in once and not re-loading them. Also, we know that datadog-agent is a go program that's calling out to python somehow. Nothing wrong with that, but it's the kind of thing that can lead to unintended side-effects because we're dealing with the magic of child processes, you know?

But unless there's some strange pre-load of the linux passwd database being stored with the process's environment... we also don't see how loading things in once and not re-loading could be the problem. We're grasping at straws and hoping you can lend some insight that might lead to an ah-hah moment.

Separately, this question of not being able to update python checks is easier to understand. You might investigate python's imp.load_module():

# needs an import of your `Sssd` class
import imp
import time

CHECK_INTERVAL = 60 # in seconds
RELOAD_INTERVAL = 300 # in seconds (from configuration)
last_reload = time.time()
sssd = Sssd()
while True:
    if time.time() - last_reload > RELOAD_INTERVAL:
        imp.load_module('checks')
        sssd = Sssd()
        last_reload = time.time()
    sssd.check({})
    time.sleep(CHECK_INTERVAL)

that's not very pretty, but you don't have to redesign things to get reloads. Cleaner way would be to have your python runner call out to each check in a separate interpreter via os.system(). That way, each check gets reloaded every time its run. Either way, I think that the capacity to update the checks without having to bounce the datadog-agent is a feature request worthy of consideration. But it's not the main problem that I'm trying to solve.

olivielpeau commented 6 years ago

Thanks for the additional details @jar349.

Just to clarify, I don't think simply making the python interpreter re-load the python module will fix the issue you're experiencing. Re-loading the python module only means that the python interpreter will parse the related .py source files again and update its internal references, but it has no effect on the process state or the libraries loaded by the process.

To troubleshoot this further, I'd recommend running a snippet of code close to the following, with /opt/datadog-agent/embedded/bin/python:

import pwd
import time

USERNAME = "username"

while True:
    try:
        user = pwd.getpwnam(USERNAME)
        print user
    except Exception as e:
        print "Failed to look up user={}, exception={}".format(USERNAME, e)
    time.sleep(10) # in seconds

and trying to reproduce the issue when the sssd daemon goes down and back up. strace-ing the python process could help, along with debug logs from sssd to understand if the client (our python process) correctly communicates with sssd, including after sssd has had issues.

Let me know if this helps

jar349 commented 6 years ago

Yes, there are two separate things going on with this one issue and I'm sorry for that. I agree that making the python interpreter re-load will not fix my issue. Still, I believe it's a good feature.

We've already run your experiment. It runs correctly and returns the user - all while the erroneously executing datadog-agent-spawned check continues to fail. This is why we believe there's something environmental going on. If we restart the datadog-agent, it too begins working correctly.

olivielpeau commented 6 years ago

Ok, thanks for clarifying!

Are you able to reliably reproduce the issue? If so, when the issue happens with the Agent, can you see anything in sssd's logs that would indicate that your check is actually connecting to sssd? For reference, see items 3. and 4. on sssd's troubleshooting docs: https://docs.pagure.org/SSSD.sssd/users/troubleshooting.html#getent-passwd-or-id-doesn-t-print-the-user-or-getent-group-doesn-t-print-the-group-at-all

This issue may be related to the multi-threaded nature of the check runs on Agent v6: 2 consecutive runs of the same check won't necessarily happen on the same OS thread. This could cause some thread-related bug in the sss client, which in this case may not be able to properly re-connect to sssd after it restarts. Unfortunately python's pwd doesn't give much info on the actual error that's returned by getpwnam, so I'd advise trying to see if the check can at least connect to sssd after a restart.

jar349 commented 6 years ago

Thank you very much for the link. We've gone through those steps and found that there are no sssd log entries for not-in-etc-password. This was part of the evidence that made us think that the problem was not with sssd: we saw log entries for other lookups... but not for the one in the script. If we restart the datadog agent, it works and we do see entries in sssd.

olivielpeau commented 6 years ago

Thanks for confirming this! It could point to an issue in the sss client when used from different threads of the same process (although it's just a guess at this point).

If you have the chance, and if you can reproduce the issue reliably, it'd be great if you could strace an Agent that runs only your custom check, (sudo strace -fp $(pgrep -f datadog-agent/bin/agent/agent)), ideally at the transition when it starts failing lookups on sssd.

And one more question: which version of sssd are you using?

jar349 commented 6 years ago

we're using sssd 1.16.1.

We're not able to reproduce the issue, and it hasn't happened in about a week.

We've run strace against the agent process, but it doesn't tell us much because it's not really the process that runs the check - which child process should we strace? Which is the python runner?

olivielpeau commented 6 years ago

The python runtime runs in the same process as the main agent (the agent initializes and invokes the embedded python runtime through its C API). The agent process is multi-threaded so when strace-ing you have to make sure you also include the other threads of the process (hence the -f flag on the strace command above, you may also want to add the -b execve parameter to filter out uninteresting syscalls of the child processes created while strace-ing the main process).

If you run into this issue again, could you send a flare to our support team before you restart your agent? This may help us understand what's wrong.

As a workaround, you could make your check run getent passwd not-in-etc-password in a subprocess (using the get_subprocess_output function from datadog_checks.utils.subprocess_output) and parse its output. This would allow running the sss client in a new process at each check run (and would avoid the interaction between the long-lived thread/process state and the sss client which could be causing your issue).

jar349 commented 6 years ago

Do you have a Case ID for me?

olivielpeau commented 6 years ago

@jar349 you can leave the case ID blank, it'll create a new case on our side

DataDog / datadog-agent

Unable to update python-based check without restart #2174