Bug: server gets into weird state when doing massive parallel version loads

pseudomorph commented 1 year ago

I've set up an instance of Terrareg with ~15 different modules backed by Git repositories. When attempting to load versions, one at a time per-module but in parallel across all modules, the server eventually gets into a state where it cannot do any version loads, throwing the following error for each attempt.

Using a MySQL DB backend, and EFS for the module data directory.

  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2548, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2528, in wsgi_app
    response = self.handle_exception(e)
  File "/usr/local/lib/python3.10/site-packages/flask_restful/__init__.py", line 271, in error_router
    return original_handler(e)
  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.10/site-packages/flask_restful/__init__.py", line 271, in error_router
    return original_handler(e)
  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/usr/local/lib/python3.10/site-packages/flask_restful/__init__.py", line 467, in wrapper
    resp = resource(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/flask/views.py", line 107, in view
    return current_app.ensure_sync(self.dispatch_request)(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/flask_restful/__init__.py", line 582, in dispatch_request
    resp = meth(*args, **kwargs)
  File "/app/terrareg/auth_wrapper.py", line 32, in wrapper
    return func(*args, **kwargs)
  File "/app/terrareg/server/error_catching_resource.py", line 41, in post
    return self._post(*args, **kwargs)
  File "/app/terrareg/server/api/module_version_create.py", line 27, in _post
    previous_version_published = module_version.prepare_module()
  File "/app/terrareg/models.py", line 2990, in prepare_module
    self.create_data_directory()
  File "/app/terrareg/models.py", line 2874, in create_data_directory
    if not os.path.isdir(self._module_provider.base_directory):
  File "/app/terrareg/models.py", line 1582, in base_directory
    return safe_join_paths(self._module.base_directory, self._name)
  File "/app/terrareg/models.py", line 891, in base_directory
    return safe_join_paths(self._namespace.base_directory, self._name)
  File "/app/terrareg/models.py", line 756, in base_directory
    return safe_join_paths(terrareg.config.Config().DATA_DIRECTORY, 'modules', self._name)
  File "/app/terrareg/config.py", line 82, in DATA_DIRECTORY
    return os.path.join(os.environ.get('DATA_DIRECTORY', os.getcwd()), 'data')
FileNotFoundError: [Errno 2] No such file or directory

Looking at an strace, it seems like os.getcwd is getting called on a directory which no longer exists. The issue seems to go away with a server re-deploy. I'm still digging into the issue, but wanted to file this.

The loads to work for some time, until this error happens, then they all fail.

pseudomorph commented 1 year ago

It seems like the main process CWD at some point gets set to the CWD for one of the temp dirs, which is deleted, and then causes all subsequent threads or operations to fail when performing the os.getcwd operations.

Found this for the main process after the event occured:

/proc/2345/cwd -> /tmp/tmpnrly7tay (deleted)

pseudomorph commented 1 year ago

My guess would be that passing the cwd= param in subprocess is changing the cwd for the main process, which leaves it vulnerable if some subprocess encounters an error.

I imagine that using fully-qualified paths in the shell out commands, and terraform -chdir=path <command (as this doesn't change the cwd for the main python process) would alleviate this.

--

Or, replacing the os.getcwd calls with absolute references.

MatthewJohn commented 1 year ago

Hey @pseudomorph,

Thank you for reporting this. It's interesting indeed - I'm surprised that spawning a subprocess would change the cwd of the parent process. I'll have a look through for any instances of dubious subprocess creations (at least, non-standard ones) and will take a read to see if I can find any information on this. Is this an occurrence that you've seen in python before?

Doing a brief mock-up, even within the same thread, I don't see this behavour immediately:

>>> import subprocess
>>> import os
>>> p = subprocess.Popen(["bash", "-c", "sleep 30"], cwd="/tmp")
>>> os.getcwd()
'/home/user'
>>> p.wait()
...
0

You mention that you're using EFS - what is the CWD of the container in normal conditions? (Has it been changed to a mount-point of EFS mounted in a temp directory?)

Many thanks Matt

pseudomorph commented 1 year ago

Yeah, I also saw the same thing. So that's out.

The tmp dir is local within the container, so I doubt there's any weirdness with EFS.

It's really strange to me, I set up a strace on the main PID which followed forks, never was a chdir or execve called on the main process, but eventually, the cwd for the main PID gets set to one of the created temp dirs. I can find nothing in the code or syscalls which points me to a problem. But I can reliably reproduce it.

pseudomorph commented 1 year ago

I'm going to keep digging and will hopefully report back with findings soon.

Here's a brief overview of the process:

Deploy server in ECS
Set up initial namespace
Run a script to populate modules within namespace (~16)
Use xargs to fire off ~8 processes which iterate over remote tags on the respective GH repos and make api calls to create_module_version and publish_module_version

There are some spurious missing dir errors along the way, but many versions get loaded and published. I had a watch set up on /proc/MAIN_PID/cwd to check what it was set to, occasionally it would cut over to a /tmp/tmp1234 type dir, but usually went back to /app. Eventually the cwd changed back to a temp dir, but the temp dir gets removed and all subsequent calls to os.getcwd() fail and cause the server to stop functioning for these api calls until it's redeployed.

MatthewJohn commented 1 year ago

Aha, I may have made some progress :)

After looking for possible things that could change the directory, another came to mine - It's actually shutil.make_archive!

mkdir /tmp/tempdirhere
dd if=/dev/urandom of=/tmp/tempdirhere/large.blob bs=1M count=1000

cat > test_make_archive.py <<'EOF'
#!python3

import shutil
from threading import Thread
import time
import os

def zip_file():
    shutil.make_archive('blah', 'zip', '/tmp/tempdirhere')

thread = Thread(target=zip_file)
thread.start()
time.sleep(1)
print(os.getcwd())
thread.join()

EOF

python ./test_make_archive.py 
/tmp/tempdirhere

This should be fairly easy workaround, so I can get on with a fix this evening :) I'll also take a close look at the documentation for this function - I must have missed a note about this, so apologies

Many thanks Matt

Edit: And, yes, they mention that it's not thread-safe because it changes directory: https://docs.python.org/3/library/shutil.html#shutil.make_archive

Though interestingly, (I've used this method for quite some time in projects) and in python2.7 it wasn't marked as being unsafe for multithreading (https://docs.python.org/2.7/library/shutil.html#shutil.make_archive) - not sure if it's a lack of documentation or whether the functionality changed

MatthewJohn commented 1 year ago

Hey @pseudomorph, I've created an upstream issue (https://gitlab.dockstudios.co.uk/pub/terrareg/-/issues/405) and I've created a PR with a potential fix (https://gitlab.dockstudios.co.uk/pub/terrareg/-/merge_requests/339).

I've realised that the testing around the generated archive is quite lax, so need to work on adding some tests for this, which should probably be done before merging. Though, if you want/need the fix early, feel free to give it a spin :)

Alternatively, since you're using a git provider, the generated archive will not be used, as the git URL is served to Terraform when obtaining the module. The zip functionality can be disabled by setting the DELETE_EXTERNALLY_HOSTED_ARTIFACTS configuration :)

Many thanks Matt

Edit: I've now implemented a test to ensure the zip/tar files are generated correctly :)

pseudomorph commented 1 year ago

Great find! I'll run the patch and see how it works. Thanks so much for your help!

pseudomorph commented 1 year ago

Looks like the output from zip gets put into the logs (also terraform output gets put into logs by default too). Should this be conditional on log-level? Perhaps only under debug or trace?

pseudomorph commented 1 year ago

So far the patch works swimmingly!

MatthewJohn commented 1 year ago

Looks like the output from zip gets put into the logs (also terraform output gets put into logs by default too). Should this be conditional on log-level? Perhaps only under debug or trace?

I'm afraid the log handling of the entire project is in a dire state - it needs to be tackled, but don't have a foundation at the moment for logging at a particular level. Will ensure there's a ticket so that it's sorted it

So far the patch works swimmingly!

Excellent :)

MatthewJohn commented 1 year ago

I've found this in the backlog: https://gitlab.dockstudios.co.uk/pub/terrareg/-/issues/382 which should take care of the logging/stdout from subprocesses :)

MatthewJohn commented 1 year ago

I've now merged this PR and the fix is available in https://github.com/MatthewJohn/terrareg/releases/tag/v2.71.2 :)

pseudomorph commented 1 year ago

Thank you for the quick attention and resolution on this!

MatthewJohn commented 1 year ago

No problem :) Are you happy for me to close this? If not, feel free to re-open :)

MatthewJohn / terrareg

Bug: server gets into weird state when doing massive parallel version loads #9