Safe2COVIDApp / bct-server

Bluetooth Contact Tracing for Covid19 - server
5 stars 1 forks source link

Scaling crash: Too many open files #168

Closed mitra42 closed 4 years ago

mitra42 commented 4 years ago

Looks like we have a problem when exceed some number of threads. See trace in next comment.

mitra42 commented 4 years ago

Logging an uncaught exception Traceback (most recent call last): File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 870, in run self._target(*self._args, self._kwargs) File "/usr/local/lib/python3.7/site-packages/twisted/_threads/_threadworker.py", line 46, in work task() File "/usr/local/lib/python3.7/site-packages/twisted/_threads/_team.py", line 190, in doWork task() --- --- File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 250, in inContext result = inContext.theWork() File "/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py", line 266, in inContext.theWork = lambda: context.call(ctx, func, *args, *kw) File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 122, in callWithContext return self.currentContext().callWithContext(ctx, func, args, kw) File "/usr/local/lib/python3.7/site-packages/twisted/python/context.py", line 85, in callWithContext return func(*args,**kw) File "server.py", line 118, in _deferred_function result = function() File "/Users/mitra/git/bct-server2/contacts.py", line 852, in get_location_id_data return list(self.spatial_dict.retrieve_json_from_file_paths(locations_file_path)) File "/Users/mitra/git/bct-server2/contacts.py", line 233, in retrieve_json_from_file_paths yield self.retrieve_json_from_file_path(file_path) File "/Users/mitra/git/bct-server2/contacts.py", line 226, in retrieve_json_from_file_path raise e # Put a breakpoint here if seeing this fail File "/Users/mitra/git/bct-server2/contacts.py", line 224, in retrieve_json_from_file_path return json.load(open(self.directory + '/' + file_path)) builtins.OSError: [Errno 24] Too many open files: '/tmp/spatial_dict/02/32/80/0232804650:1590901248.964945:0.data'

mitra42 commented 4 years ago

@danaronson - any thoughts - on a similar javascript problem there was a nice queue class that simply calls back when complete and only runs a certain number, is there anything similar for python that you have used - alternatively could back off and retry ? Luckily we only have file reads in a couple of places so might not be hard to catch them all.

danaronson commented 4 years ago

What was the trigger for this bug. How many simultaneous connections? Trying to figure it if it was a combination of open sockets + open files.

On Sat, May 30, 2020 at 10:04 PM Mitra Ardron notifications@github.com wrote:

@danaronson https://github.com/danaronson - any thoughts - on a similar javascript problem there was a nice queue class that simply calls back when complete and only runs a certain number, is there anything similar for python that you have used - alternatively could back off and retry ? Luckily we only have file reads in a couple of places so might not be hard to catch them all.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Safe2COVIDApp/bct-server/issues/168#issuecomment-636422521, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYRXAVJFEKGZC6ZEBDUALRUHQO3ANCNFSM4NPABCCQ .

mitra42 commented 4 years ago

I was running a larger than previous multi-client test (200 clients, 200 steps) I'm guessing at times it just hits too many queries in process simultaneously. I put a backoff in there and its behaving better, for example I'm part way through the test and hit this about once a minute and it then works at 2nd attempt, but I'm wondering if there is a more pythonic way that doesnt complicate stuff too much...

    def retrieve_json_from_file_path(self, file_path):
        max_tries = 100
        while True:  # Exits via return or raise
            max_tries -= 1
            try:
                return json.load(open(self.directory + '/' + file_path))
            except Exception as e:
                logger.error("Error in retrieve_json_from_file_path {file_path} {e}",file_path=self.directory + '/' + file_path, e=str(e))
                if max_tries == 0:
                    raise e  # Put a breakpoint here if seeing this fail
                time.sleep(random.uniform(0,0.500))   # Wait a little while and try again
danaronson commented 4 years ago

well, you could use recursion to clean it up a bit, something like:

def retrieve_json_from_file_path(self, file_path, attempts = 100):
    try:
        return json.load(open(self.directory + '/' + file_path))
    except Exception as e:
        logger.error("Error in retrieve_json_from_file_path {file_path} {e}",file_path=self.directory + '/' + file_path, e=str(e))
    if 0 == attempts:
            raise e
    else:
            time.sleep(random.uniform(0,0.500))   # Wait a little while and try again        max_tries = 100
            return self.retrieve_json_from_file_path(file_path, attempts - 1)
mitra42 commented 4 years ago

Sure - but recursion in something that is effectively a long loop can be a memory hog can't it ?

danaronson commented 4 years ago

yes, recursion is always going to take more memory (since it needs to build up the call stack, probably a couple of words/call here). Either is fine, I go towards recursion for readability often. Your call. Some languages optimize tail recursion and turn it into a loop. I don't think CPython does.

On Sat, May 30, 2020 at 10:55 PM Mitra Ardron notifications@github.com wrote:

Sure - but recursion in something that is effectively a long loop can be a memory hog can't it ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Safe2COVIDApp/bct-server/issues/168#issuecomment-636427057, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAYRXDVVJT54GWXUBL3NLDRUHWORANCNFSM4NPABCCQ .

mitra42 commented 4 years ago

Back off worked - and didn't happen often enough for more complex solution