Closed pseudomorph closed 1 year ago
It seems like the main process CWD at some point gets set to the CWD for one of the temp dirs, which is deleted, and then causes all subsequent threads or operations to fail when performing the os.getcwd
operations.
Found this for the main process after the event occured:
/proc/2345/cwd -> /tmp/tmpnrly7tay (deleted)
My guess would be that passing the cwd=
param in subprocess is changing the cwd for the main process, which leaves it vulnerable if some subprocess encounters an error.
I imagine that using fully-qualified paths in the shell out commands, and terraform -chdir=path <command
(as this doesn't change the cwd for the main python process) would alleviate this.
--
Or, replacing the os.getcwd
calls with absolute references.
Hey @pseudomorph,
Thank you for reporting this. It's interesting indeed - I'm surprised that spawning a subprocess would change the cwd of the parent process. I'll have a look through for any instances of dubious subprocess creations (at least, non-standard ones) and will take a read to see if I can find any information on this. Is this an occurrence that you've seen in python before?
Doing a brief mock-up, even within the same thread, I don't see this behavour immediately:
>>> import subprocess
>>> import os
>>> p = subprocess.Popen(["bash", "-c", "sleep 30"], cwd="/tmp")
>>> os.getcwd()
'/home/user'
>>> p.wait()
...
0
You mention that you're using EFS - what is the CWD of the container in normal conditions? (Has it been changed to a mount-point of EFS mounted in a temp directory?)
Many thanks Matt
Yeah, I also saw the same thing. So that's out.
The tmp dir is local within the container, so I doubt there's any weirdness with EFS.
It's really strange to me, I set up a strace on the main PID which followed forks, never was a chdir
or execve
called on the main process, but eventually, the cwd
for the main PID gets set to one of the created temp dirs. I can find nothing in the code or syscalls which points me to a problem. But I can reliably reproduce it.
I'm going to keep digging and will hopefully report back with findings soon.
Here's a brief overview of the process:
There are some spurious missing dir errors along the way, but many versions get loaded and published. I had a watch set up on /proc/MAIN_PID/cwd
to check what it was set to, occasionally it would cut over to a /tmp/tmp1234
type dir, but usually went back to /app
. Eventually the cwd changed back to a temp dir, but the temp dir gets removed and all subsequent calls to os.getcwd()
fail and cause the server to stop functioning for these api calls until it's redeployed.
Aha, I may have made some progress :)
After looking for possible things that could change the directory, another came to mine - It's actually shutil.make_archive
!
mkdir /tmp/tempdirhere
dd if=/dev/urandom of=/tmp/tempdirhere/large.blob bs=1M count=1000
cat > test_make_archive.py <<'EOF'
#!python3
import shutil
from threading import Thread
import time
import os
def zip_file():
shutil.make_archive('blah', 'zip', '/tmp/tempdirhere')
thread = Thread(target=zip_file)
thread.start()
time.sleep(1)
print(os.getcwd())
thread.join()
EOF
python ./test_make_archive.py
/tmp/tempdirhere
This should be fairly easy workaround, so I can get on with a fix this evening :) I'll also take a close look at the documentation for this function - I must have missed a note about this, so apologies
Many thanks Matt
Edit: And, yes, they mention that it's not thread-safe because it changes directory: https://docs.python.org/3/library/shutil.html#shutil.make_archive
Though interestingly, (I've used this method for quite some time in projects) and in python2.7 it wasn't marked as being unsafe for multithreading (https://docs.python.org/2.7/library/shutil.html#shutil.make_archive) - not sure if it's a lack of documentation or whether the functionality changed
Hey @pseudomorph, I've created an upstream issue (https://gitlab.dockstudios.co.uk/pub/terrareg/-/issues/405) and I've created a PR with a potential fix (https://gitlab.dockstudios.co.uk/pub/terrareg/-/merge_requests/339).
I've realised that the testing around the generated archive is quite lax, so need to work on adding some tests for this, which should probably be done before merging. Though, if you want/need the fix early, feel free to give it a spin :)
Alternatively, since you're using a git provider, the generated archive will not be used, as the git URL is served to Terraform when obtaining the module. The zip functionality can be disabled by setting the DELETE_EXTERNALLY_HOSTED_ARTIFACTS
configuration :)
Many thanks Matt
Edit: I've now implemented a test to ensure the zip/tar files are generated correctly :)
Great find! I'll run the patch and see how it works. Thanks so much for your help!
Looks like the output from zip gets put into the logs (also terraform output gets put into logs by default too). Should this be conditional on log-level? Perhaps only under debug
or trace
?
So far the patch works swimmingly!
Looks like the output from zip gets put into the logs (also terraform output gets put into logs by default too). Should this be conditional on log-level? Perhaps only under
debug
ortrace
?
I'm afraid the log handling of the entire project is in a dire state - it needs to be tackled, but don't have a foundation at the moment for logging at a particular level. Will ensure there's a ticket so that it's sorted it
So far the patch works swimmingly!
Excellent :)
I've found this in the backlog: https://gitlab.dockstudios.co.uk/pub/terrareg/-/issues/382 which should take care of the logging/stdout from subprocesses :)
I've now merged this PR and the fix is available in https://github.com/MatthewJohn/terrareg/releases/tag/v2.71.2 :)
Thank you for the quick attention and resolution on this!
No problem :) Are you happy for me to close this? If not, feel free to re-open :)
I've set up an instance of Terrareg with ~15 different modules backed by Git repositories. When attempting to load versions, one at a time per-module but in parallel across all modules, the server eventually gets into a state where it cannot do any version loads, throwing the following error for each attempt.
Using a MySQL DB backend, and EFS for the module data directory.
Looking at an strace, it seems like
os.getcwd
is getting called on a directory which no longer exists. The issue seems to go away with a server re-deploy. I'm still digging into the issue, but wanted to file this.The loads to work for some time, until this error happens, then they all fail.