GIT_AUTHOR_NAME and GIT_AUTHOR_EMAIL filtered from environment

dbnicholson commented 3 months ago

The documented way to run the checker and get the desired git author information is to set the GIT_AUTHOR_NAME and GIT_AUTHOR_EMAIL environment variables. Unfortunately, these are filtered out because they contain the phrase auth.

I think this is also why the documentation says to set EMAIL. It shouldn't be needed except that GIT_AUTHOR_EMAIL doesn't get through. In endless-key-flatpak, our commits are being authored with root <os@endlessos.org>. In other words, we get the email address because we also set EMAIL in the workflow, but we lose our desired GIT_AUTHOR_NAME.

Not sure the best way to handle it. The sensitive variable filtering seems useful, and it does make sense that environment variables containing auth are likely to be one of those. On the other hand, why is the environment being filtered? Because it's dumped to the output? I would think some environment variable containing credentials might actually be required to run f-e-d-c depending on what the checkers do or what the git push does. @gasinvein?

dbnicholson commented 3 months ago

I went back and looked at #97 and I don't see a real requirement to filter environment variables. There is definitely logic in not leaking potentially sensitive data to other processes, but filtering environment variables is both not sufficient and potentially wrong as seen here.

I also don't totally get why it happens for the committing portion since it uses regular subprocess.check_call instead of the environment filtering helper. Anyways, I think clear_env can likely be dropped or only used when dumping environment variables to output.

dbnicholson commented 2 months ago

I threw in a bunch of debug logging and this is quite strange. I'm running the checker with GIT_AUTHOR_NAME set to some value in the environment. I added a helper that dumps the value out if it's set. First the helper looked like this:

def dump_author():
    if 'GIT_AUTHOR_NAME' in os.environ:
        log.info(f'GIT_AUTHOR_NAME={os.environ["GIT_AUTHOR_NAME"]}')
    else:
        log.info('GIT_AUTHOR_NAME not in environment')

Then I added a bunch of calls to the helper. I ran the test tests.test_main.TestEntrypoint since it runs the whole thing including making a git commit. According to the output, GIT_AUTHOR_NAME remained set in the environment the whole time even though the git commit output showed it wasn't used!

So, then I changed the helper to this:

def dump_author():
    env = subprocess.check_output(['env'], text=True).splitlines()
    for line in env:
        if line.startswith('GIT_AUTHOR_NAME='):
            log.info(line)
            return

    log.info('GIT_AUTHOR_NAME not in environment')

And now it showed the environment variable not being set right after this. Uh, what? So, somehow the environment used by subprocess was being corrupted but os.environ/os.getenv() were not?

dbnicholson commented 2 months ago

On a whim I decided to make this change since the environment variables weren't actually being altered:

diff --git a/src/lib/utils.py b/src/lib/utils.py
index 92ac90b..2aa3bec 100644
--- a/src/lib/utils.py
+++ b/src/lib/utils.py
@@ -273,6 +273,8 @@ def filter_versions(

 def clear_env(environ):
-    new_env = copy.deepcopy(environ)
+    new_env = environ.copy()
     for varname in new_env.keys():
         if any(i in varname.lower() for i in ["pass", "token", "secret", "auth"]):

This caused a crash with RuntimeError: dictionary changed size during iteration since the loop edits the dictionary in place. That seems obvious in retrospect. Why this succeeds with the dictionary created with copy.deepcopy, I have no idea. Making the required change:

diff --git a/src/lib/utils.py b/src/lib/utils.py
index 92ac90b..0beecfa 100644
--- a/src/lib/utils.py
+++ b/src/lib/utils.py
@@ -272,7 +272,9 @@ def filter_versions(

 def clear_env(environ):
-    new_env = copy.deepcopy(environ)
-    for varname in new_env.keys():
+    new_env = environ.copy()
+    for varname in list(new_env.keys()):
         if any(i in varname.lower() for i in ["pass", "token", "secret", "auth"]):
             log.debug("Removing env %s", varname)

And it works! Going back to copy.deepcopy but keeping the list(new_env.keys()) does not, though. I have no idea why, but presumably the environment filtering is not playing nice with all the async execution.

dbnicholson commented 2 months ago

However, while I think it would be possible to fix this, I think it should just be dropped entirely. This type of blind filtering is bound to break things (as it has here) by assuming that the processes that are being executed don't require those environment variables.

flathub-infra / flatpak-external-data-checker

GIT_AUTHOR_NAME and GIT_AUTHOR_EMAIL filtered from environment #413