celery / billiard

Multiprocessing Pool Extensions
Other
417 stars 252 forks source link

SemLock leak: fill /dev/shm when creating and deleting multiple Pools #293

Open thomas-riccardi opened 4 years ago

thomas-riccardi commented 4 years ago

Initial issue:

OSError: [Errno 28] No space left on device
  File "xxx.py", line 1029, in main
    pool = Pool(nb_processes)
  File "billiard/pool.py", line 995, in __init__
    self._setup_queues()
  File "billiard/pool.py", line 1364, in _setup_queues
    self._inqueue = self._ctx.SimpleQueue()
  File "billiard/context.py", line 150, in SimpleQueue
    return SimpleQueue(ctx=self.get_context())
  File "billiard/queues.py", line 377, in __init__
    self._rlock = ctx.Lock()
  File "billiard/context.py", line 105, in Lock
    return Lock(ctx=self.get_context())
  File "billiard/synchronize.py", line 182, in __init__
    SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
  File "billiard/synchronize.py", line 72, in __init__
    kind, value, maxvalue, self._make_name(), unlink_now,

(with billiard==3.6.1.0)

Reproduction

import billiard
import multiprocessing
import os
import subprocess

DEBUG = False

def check_shm():
    out = subprocess.check_output(['ls -la /proc/{}/map_files/ | grep /dev/shm/ || :'.format(os.getpid())], shell=True)
    lines = out.splitlines()
    if DEBUG:
        print('Found {} files in /dev/shm from current process:'.format(len(lines)))
        for line in lines:
            print(line.decode())
    df_out = subprocess.check_output(["df /dev/shm | tail -1 | awk '{print $3}'"], shell=True)
    df_lines = df_out.splitlines()
    return (len(lines), int(df_lines[0].decode()))

def repro(Pool):
    for i in range(1, 10):
        name = f'{Pool.__module__}.{Pool.__name__} {i}'
        print(f'{name} start {check_shm()}')
        p = Pool(1)
        print(f'{name} created {check_shm()}')
        p.close()
        p.join()
        print(f'{name} joined {check_shm()}')
        del p
        print(f'{name} deleted {check_shm()}')

repro(multiprocessing.Pool)
repro(billiard.Pool)

Result:

multiprocessing.context.Pool 1 start (0, 0)
multiprocessing.context.Pool 1 created (4, 16)
multiprocessing.context.Pool 1 joined (4, 16)
multiprocessing.context.Pool 1 deleted (0, 0)
multiprocessing.context.Pool 2 start (0, 0)
multiprocessing.context.Pool 2 created (4, 16)
multiprocessing.context.Pool 2 joined (4, 16)
multiprocessing.context.Pool 2 deleted (0, 0)
multiprocessing.context.Pool 3 start (0, 0)
multiprocessing.context.Pool 3 created (4, 16)
multiprocessing.context.Pool 3 joined (4, 16)
multiprocessing.context.Pool 3 deleted (0, 0)
multiprocessing.context.Pool 4 start (0, 0)
multiprocessing.context.Pool 4 created (4, 16)
multiprocessing.context.Pool 4 joined (4, 16)
multiprocessing.context.Pool 4 deleted (0, 0)
multiprocessing.context.Pool 5 start (0, 0)
multiprocessing.context.Pool 5 created (4, 16)
multiprocessing.context.Pool 5 joined (4, 16)
multiprocessing.context.Pool 5 deleted (0, 0)
multiprocessing.context.Pool 6 start (0, 0)
multiprocessing.context.Pool 6 created (4, 16)
multiprocessing.context.Pool 6 joined (4, 16)
multiprocessing.context.Pool 6 deleted (0, 0)
multiprocessing.context.Pool 7 start (0, 0)
multiprocessing.context.Pool 7 created (4, 16)
multiprocessing.context.Pool 7 joined (4, 16)
multiprocessing.context.Pool 7 deleted (0, 0)
multiprocessing.context.Pool 8 start (0, 0)
multiprocessing.context.Pool 8 created (4, 16)
multiprocessing.context.Pool 8 joined (4, 16)
multiprocessing.context.Pool 8 deleted (0, 0)
multiprocessing.context.Pool 9 start (0, 0)
multiprocessing.context.Pool 9 created (4, 16)
multiprocessing.context.Pool 9 joined (4, 16)
multiprocessing.context.Pool 9 deleted (0, 0)
billiard.context.Pool 1 start (0, 0)
billiard.context.Pool 1 created (5, 20)
billiard.context.Pool 1 joined (5, 20)
billiard.context.Pool 1 deleted (5, 20)
billiard.context.Pool 2 start (5, 20)
billiard.context.Pool 2 created (10, 40)
billiard.context.Pool 2 joined (10, 40)
billiard.context.Pool 2 deleted (10, 40)
billiard.context.Pool 3 start (10, 40)
billiard.context.Pool 3 created (15, 60)
billiard.context.Pool 3 joined (15, 60)
billiard.context.Pool 3 deleted (15, 60)
billiard.context.Pool 4 start (15, 60)
billiard.context.Pool 4 created (20, 80)
billiard.context.Pool 4 joined (20, 80)
billiard.context.Pool 4 deleted (20, 80)
billiard.context.Pool 5 start (20, 80)
billiard.context.Pool 5 created (25, 100)
billiard.context.Pool 5 joined (25, 100)
billiard.context.Pool 5 deleted (25, 100)
billiard.context.Pool 6 start (25, 100)
billiard.context.Pool 6 created (30, 120)
billiard.context.Pool 6 joined (30, 120)
billiard.context.Pool 6 deleted (30, 120)
billiard.context.Pool 7 start (30, 120)
billiard.context.Pool 7 created (35, 140)
billiard.context.Pool 7 joined (35, 140)
billiard.context.Pool 7 deleted (35, 140)
billiard.context.Pool 8 start (35, 140)
billiard.context.Pool 8 created (40, 160)
billiard.context.Pool 8 joined (40, 160)
billiard.context.Pool 8 deleted (40, 160)
billiard.context.Pool 9 start (40, 160)
billiard.context.Pool 9 created (45, 180)
billiard.context.Pool 9 joined (45, 180)
billiard.context.Pool 9 deleted (45, 180)

Analysis

Files in /dev/shm leak, in fact they are marked as deleted (they don't appear in ls /dev/shm), only memory mapped to the process. df clearly show the disk space is indeed used and filling up.

I tried with various versions of python and billiard:

This is an issue in docker/kubernetes where the default /dev/shm is 64MB (for each container) which can be filled up.

Source analysis

I tried to compare billiard and multiprocessing sources and could not find anything different in semaphore.c or in the python stack where the SemLock is created...

I then assumed some object was referencing the SemLock with billiard, using gc.get_referrers(): I could not find any difference with multiprocessing...

auvipy commented 4 years ago

why you are using multiprocessing with billiard?

thomas-riccardi commented 4 years ago

@auvipy

why you are using multiprocessing with billiard?

I am not, I am just using billiard (in a celery worker), and it leaks in /dev/shm.

I then wrote a minimal repro script to show the issue is only with billiard, and not with multiprocessing. (and I cannot use multiprocessing because of https://github.com/celery/celery/issues/4551#issuecomment-367607234).

Sorry I was not clear in my initial bug report.

hbradlow commented 1 year ago

Is there any update on this? I am having the same isssue.