ansible-collections / ansible.windows

Windows core collection for Ansible
https://galaxy.ansible.com/ansible/windows
GNU General Public License v3.0
245 stars 164 forks source link

intermittent "failed to run exec_wrapper action module_powershell_wrapper: Failed to compile C# code" errors #657

Open Yannik opened 2 weeks ago

Yannik commented 2 weeks ago
SUMMARY

With ever growing host count (currently 180 ansible managed windows 2019/2022 servers), I am seeing more and more of these errors, breaking our deployment CI/CD pipeline:

An exception occurred during task execution. To see the full traceback, use -vvv. The error was: at <ScriptBlock>, <No file>: line 11
fatal: [xxxx]: FAILED! => changed=false 
  msg: |-
    internal error: failed to run exec_wrapper action module_powershell_wrapper: Failed to compile C# code:
    error CS0016: Could not write to output file 'c:\Users\svc_ansible_admin\AppData\Local\Temp\9ad9c096-1bd3-4845-8e20-84ea3f018fd8\bsze5fqf.dll' -- 'The process cannot access the file because it is being used by another process. '

I had already reported this here, in an issue with a similar problem that was successfully resolved thanks to @jborean93.

Would be great if it was possible to solve this one too.

ISSUE TYPE
COMPONENT NAME

Unsure

ANSIBLE VERSION
ansible [core 2.16.11]
  config file = None
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /app/lib/python3.12/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /app/bin/ansible
  python version = 3.12.6 (main, Sep  9 2024, 18:09:49) [GCC 13.2.1 20240309] (/usr/local/bin/python)
  jinja version = 3.1.4
  libyaml = True
STEPS TO REPRODUCE

Execute any windows task on enough hosts and you will run into this.

jborean93 commented 2 weeks ago

Unfortunately there is not much we can do here, the process to compile the code uses csc.exe (called by the C# compiler methods) and the error you see here is from csc.exe itself and not any code we control. The typical reason why you would see this error is an AV or other scanning tool is either deleting or in your case holding an exclusive lock on the file. As we don't control how csc.exe work we have little sway over the outcome here.

We do provide a way to change the temporary directory used here through the remote_tmp option on the shell plugin. This could potentially be changed to a location that is either trusted by the AV or maybe less likely for it to be scanned and locked during the run.

Yannik commented 2 weeks ago

We do provide a way to change the temporary directory used here through the remote_tmp option on the shell plugin. This could potentially be changed to a location that is either trusted by the AV or maybe less likely for it to be scanned and locked during the run.

As far as I can see, this directory could simply be used by an attacker as well, creating an attack vector? (Unless the code is signed - which I'm sure it isn't.. That said - signing of the temporary code done by the ansible controller DOES sound like an interesting idea!)

Anyway - wouldn't a retry/backoff mechanism pretty much solve this problem? Since this is only occuring every couple thousand task executions, it seems very much like unlucky timing.

jborean93 commented 1 week ago

As far as I can see, this directory could simply be used by an attacker as well, creating an attack vector?

It's certainly not idea but potentially just changing it to another var and not the default $env:TEMP might be enough to stop the AV from picking it up.

That said - signing of the temporary code done by the ansible controller DOES sound like an interesting idea!)

It's certainly something we are looking into potentially but there are a lot of questions it brings up which make it hard to achieve.

Anyway - wouldn't a retry/backoff mechanism pretty much solve this problem? Since this is only occuring every couple thousand task executions, it seems very much like unlucky timing.

Not necessarily, in some cases maybe but in others it could just fail everytime. In other cases there could be code out of our control that uses Add-Type and not our custom Add-CSharpType. I prefer not to add a retry mechanism for such a scenario but I could be convinced otherwise.

One area I want to also look into for the next Ansible version if I have time is to officially support PowerShell 7.x. This version uses a different compiler mechanism that doesn't require temporary files as the compilation happens in process. This could be the solution to this particular problem. I cannot guarantee that it'll be done in the next release though, just something that's on my mind.

Yannik commented 1 week ago

I am experimenting with remote_tmp now, but I suspect that the AV simply has a look at all new files, no matter which directory they are in.

Seeing that async_dir is set to %USERPROFILE%\.ansible_async, I configured remote_tmp to %USERPROFILE%\.ansible_tmp, kinda expecting the directory to be hidden, which is actually not the case, since windows does not recognize dot-prefixed items to be hidden but requires the hidden attribute. Any reason for still using the dot-prefix on async_dir? Or are you additionally setting the hidden attr on that one?

The remote_tmp dir is actually not even getting deleted after task/playbook execution, is that on purpose?

I have not rolled this out to prod just yet, so I cannot report any results on the effectiveness of fixing the errors.

One area I want to also look into for the next Ansible version if I have time is to officially support PowerShell 7.x. This version uses a different compiler mechanism that doesn't require temporary files as the compilation happens in process. This could be the solution to this particular problem. I cannot guarantee that it'll be done in the next release though, just something that's on my mind.

Sounds interesting to have that option! (Even though I don't see us rolling out powershell 7.x to all servers in the near future)

jborean93 commented 1 week ago

Any reason for still using the dot-prefix on async_dir?

It's to replicate the same behaviour on the Linux side where the dir is ~/.ansible_async and . means hidden there. We are not explicitly setting the hidden attribute.

The remote_tmp dir is actually not even getting deleted after task/playbook execution, is that on purpose?

The actual dir isn't, the value is meant to be a location where each module would create their own temp directory inside it. The default is %TEMP% which means when a temp directory is needed it will be created inside that dir and that will be the one that should be cleaned up.