celery / billiard

Multiprocessing Pool Extensions
Other
410 stars 249 forks source link

Regression in 3.6.1 - Permission Denied #288

Open AvnerCohen opened 5 years ago

AvnerCohen commented 5 years ago

On CentOs, we clearly see a regression as a result of the recent billiard update:

...
...
  File "/base_folder/venv3.7.2/lib/python3.7/site-packages/billiard/sharedctypes.py", line 37, in _new_value
    wrapper = heap.BufferWrapper(size)
  File "/base_folder/venv3.7.2/lib/python3.7/site-packages/billiard/heap.py", line 274, in __init__
    block = BufferWrapper._heap.malloc(size)
  File "/base_folder/venv3.7.2/lib/python3.7/site-packages/billiard/heap.py", line 255, in malloc
    (arena, start, stop) = self._malloc(size)
  File "/base_folder/venv3.7.2/lib/python3.7/site-packages/billiard/heap.py", line 152, in _malloc
    arena = Arena(length)
  File "/base_folder/venv3.7.2/lib/python3.7/site-packages/billiard/heap.py", line 79, in __init__
    dir=util.get_temp_dir(),
  File "/base_folder/.asdf/installs/python/3.7.2/lib/python3.7/tempfile.py", line 340, in mkstemp
    return _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/base_folder/.asdf/installs/python/3.7.2/lib/python3.7/tempfile.py", line 258, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
PermissionError: [Errno 13] Permission denied: '/tmp/pymp-j0t8dkfk/pym-4118-x1vo2w4v'

Rolling back and forcing pip install billiard==3.6.0.0 solves the issue.

I went over the changes, between the versions, but couldn't really find the exact culprit.

auvipy commented 5 years ago

https://github.com/celery/billiard/compare/v3.6.0...master this are the changes.

thedrow commented 5 years ago

You'll have to provide us with more details. Can you reproduce this locally? Does the problem reproduce on all servers? Did you check user permissions? Could it be that /tmp is not accessible to your user?

AvnerCohen commented 5 years ago

Sure. will provide everything needed, was hoping this would point to a specific change and ring a bell. I'll try to create a proper repro example.

Starou commented 4 years ago

Hi,

I got a similar issue when upgrading from Celery 4.2.2 / Billiard 3.5.0 to Celery 4.4.2 / Billiard 3.6.3.0 on a project and try to execute a debugging task from django-admin shell:

from paperboy.celery import send_debug_mail                                                                                                                                                
send_debug_mail.delay()                                                                                                                                                                    <AsyncResult: 19a2eb4d-7044-41bf-b1d4-7236c905d770>
tail -n 100 var/log/celery/w1.log
[2020-03-25 10:55:37,753: CRITICAL/MainProcess] Unrecoverable error: PermissionError(13, 'Permission denied')
Traceback (most recent call last):
  File "/home/foo/venv_foo_py3/lib/python3.6/site-packages/celery/worker/worker.py", line 205, in start
    self.blueprint.start(self)
  File "/home/foo/venv_foo_py3/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/home/foo/venv_foo_py3/lib/python3.6/site-packages/celery/bootsteps.py", line 369, in start
    return self.obj.start()
  File "/home/foo/venv_foo_py3/lib/python3.6/site-packages/celery/concurrency/base.py", line 131, in start
    self.on_start()
  File "/home/foo/venv_foo_py3/lib/python3.6/site-packages/celery/concurrency/prefork.py", line 112, in on_start
    **self.options)
  File "/home/foo/venv_foo_py3/lib/python3.6/site-packages/celery/concurrency/asynpool.py", line 428, in __init__
    super(AsynPool, self).__init__(processes, *args, **kwargs)
  File "/home/foo/venv_foo_py3/lib/python3.6/site-packages/billiard/pool.py", line 1046, in __init__
    self._create_worker_process(i)
  File "/home/foo/venv_foo_py3/lib/python3.6/site-packages/celery/concurrency/asynpool.py", line 445, in _create_worker_process
    return super(AsynPool, self)._create_worker_process(i)
  File "/home/foo/venv_foo_py3/lib/python3.6/site-packages/billiard/pool.py", line 1142, in _create_worker_process
    on_ready_counter = self._ctx.Value('i')
  File "/home/foo/venv_foo_py3/lib/python3.6/site-packages/billiard/context.py", line 182, in Value
    ctx=self.get_context())
  File "/home/foo/venv_foo_py3/lib/python3.6/site-packages/billiard/sharedctypes.py", line 83, in Value
    lock = ctx.RLock()
  File "/home/foo/venv_foo_py3/lib/python3.6/site-packages/billiard/context.py", line 110, in RLock
    return RLock(ctx=self.get_context())
  File "/home/foo/venv_foo_py3/lib/python3.6/site-packages/billiard/synchronize.py", line 207, in __init__
    SemLock.__init__(self, RECURSIVE_MUTEX, 1, 1, ctx=ctx)
  File "/home/foo/venv_foo_py3/lib/python3.6/site-packages/billiard/synchronize.py", line 72, in __init__
    kind, value, maxvalue, self._make_name(), unlink_now,
PermissionError: [Errno 13] Permission denied

To get It works I must downgrade billiard to 3.6.0 and it fails starting with 3.6.1.

Looks like the cause is that new line in pool._create_worker_process().

I have upgraded a similar project yesterday (same arch, confs, deps, versions etc) without any trouble so maybe this is a problem of resources on the OS.

Any hint on where to look for the file/directory causing the Permission denied exception?

Thanks

thedrow commented 4 years ago

I suggest using strace to see which files are being opened.

Starou commented 4 years ago

Thanks @thedrow.

Here the interesting bits in the trace when I run strace over the celery command (instead of the daemon):

stat("/home/paperboy/venv_paperboy_py3/lib/python3.6/site-packages/billiard/synchronize.py", {st_mode=S_IFREG|0644, st_size=13041, ...}) = 0
open("/home/paperboy/venv_paperboy_py3/lib/python3.6/site-packages/billiard/__pycache__/synchronize.cpython-36.pyc", O_RDONLY|O_CLOEXEC) = 19
fstat(19, {st_mode=S_IFREG|0644, st_size=12086, ...}) = 0
lseek(19, 0, SEEK_CUR)                  = 0
fstat(19, {st_mode=S_IFREG|0644, st_size=12086, ...}) = 0
read(19, "3\r\r\n\275d|^\3612\0\0\343\0\0\0\0\0\0\0\0\0\0\0\0\22\0\0\0@\0\0"..., 12087) = 12086
read(19, "", 1)                         = 0
close(19)                               = 0
stat("/home/paperboy", {st_mode=S_IFDIR|0750, st_size=4096, ...}) = 0
stat("/home/paperboy/venv_paperboy_py3/bin", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/home/paperboy/projects", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/usr/local/lib/python3.6", {st_mode=S_IFDIR|S_ISGID|0755, st_size=4096, ...}) = 0
stat("/usr/local/lib/python3.6/lib-dynload", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/home/paperboy/venv_paperboy_py3/lib/python3.6/site-packages", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
fstat(3, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 9), ...}) = 0
read(3, "<\235)[\310\3655\264z\r\370\322\307\37O\342\353\215\251\224\327\203d\314\245\222\250\213Z\275h\206"..., 2496) = 2496
fstat(3, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 9), ...}) = 0
read(3, "\311.l\301\265'\300\314\270\316\227\10\352d\210\206\354\274G\357?\342S\21u\242\330\244\273\304\255\247"..., 2496) = 2496
statfs("/dev/shm", {f_type=0x1021994, f_bsize=4096, f_blocks=51530, f_bfree=51400, f_bavail=51400, f_files=257646, f_ffree=257355, f_fsid={0, 0}, f_namelen=255, f_frsize=4096}) = 0
futex(0x7f40f4f182dc, FUTEX_WAKE_PRIVATE, 2147483647) = 0
gettimeofday({1585210644, 332329}, NULL) = 0
lstat("/dev/shm/sem.p8QdiC", 0x7fff0c60b600) = -1 ENOENT (No such file or directory)
open("/dev/shm/sem.p8QdiC", O_RDWR|O_CREAT|O_EXCL, 0600) = -1 EACCES (Permission denied)

And the Celery output:

~/venv_paperboy/bin/celery  worker -A paperboy -l INFO
/home/paperboy/venv_paperboy_py3/lib/python3.6/site-packages/celery/backends/amqp.py:67: CPendingDeprecationWarning: 
    The AMQP result backend is scheduled for deprecation in     version 4.0 and removal in version v5.0.     Please use RPC backend or a persistent backend.

  alternative='Please use RPC backend or a persistent backend.')

 -------------- celery@rpaperboy02 v4.4.2 (cliffs)
--- ***** ----- 
-- ******* ---- Linux-3.2.0-4-amd64-x86_64-with-debian-7.11 2020-03-26 09:21:56
- *** --- * --- 
- ** ---------- [config]
- ** ---------- .> app:         paperboy:0x7f80542724e0
- ** ---------- .> transport:   amqp://guest:**@localhost:5672//
- ** ---------- .> results:     amqp://
- *** --- * --- .> concurrency: 1 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery

[tasks]
  . paperboy.celery.debug_task
  . paperboy.celery.send_debug_mail
  . paperboy.composition.tasks.compose_dossiers
  . paperboy.composition.tasks.compose_pages_chemin_de_fer
  . paperboy.composition.tasks.compose_pre_bat_dossier
  . paperboy.composition.tasks.compose_pre_bat_page

[2020-03-26 09:21:56,335: CRITICAL/MainProcess] Unrecoverable error: PermissionError(13, 'Permission denied')
Traceback (most recent call last):
  File "/home/paperboy/venv_paperboy_py3/lib/python3.6/site-packages/celery/worker/worker.py", line 205, in start
    self.blueprint.start(self)
  File "/home/paperboy/venv_paperboy_py3/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/home/paperboy/venv_paperboy_py3/lib/python3.6/site-packages/celery/bootsteps.py", line 369, in start
    return self.obj.start()
  File "/home/paperboy/venv_paperboy_py3/lib/python3.6/site-packages/celery/concurrency/base.py", line 132, in start
    self.on_start()
  File "/home/paperboy/venv_paperboy_py3/lib/python3.6/site-packages/celery/concurrency/prefork.py", line 117, in on_start
    **self.options)
  File "/home/paperboy/venv_paperboy_py3/lib/python3.6/site-packages/celery/concurrency/asynpool.py", line 486, in __init__
    super(AsynPool, self).__init__(processes, *args, **kwargs)
  File "/home/paperboy/venv_paperboy_py3/lib/python3.6/site-packages/billiard/pool.py", line 1046, in __init__
    self._create_worker_process(i)
  File "/home/paperboy/venv_paperboy_py3/lib/python3.6/site-packages/celery/concurrency/asynpool.py", line 503, in _create_worker_process
    return super(AsynPool, self)._create_worker_process(i)
  File "/home/paperboy/venv_paperboy_py3/lib/python3.6/site-packages/billiard/pool.py", line 1142, in _create_worker_process
    on_ready_counter = self._ctx.Value('i')
  File "/home/paperboy/venv_paperboy_py3/lib/python3.6/site-packages/billiard/context.py", line 182, in Value
    ctx=self.get_context())
  File "/home/paperboy/venv_paperboy_py3/lib/python3.6/site-packages/billiard/sharedctypes.py", line 83, in Value
    lock = ctx.RLock()
  File "/home/paperboy/venv_paperboy_py3/lib/python3.6/site-packages/billiard/context.py", line 110, in RLock
    return RLock(ctx=self.get_context())
  File "/home/paperboy/venv_paperboy_py3/lib/python3.6/site-packages/billiard/synchronize.py", line 207, in __init__
    SemLock.__init__(self, RECURSIVE_MUTEX, 1, 1, ctx=ctx)
  File "/home/paperboy/venv_paperboy_py3/lib/python3.6/site-packages/billiard/synchronize.py", line 72, in __init__
    kind, value, maxvalue, self._make_name(), unlink_now,
PermissionError: [Errno 13] Permission denied
thedrow commented 4 years ago

That's strange. 600 means the owner can read and write to that file. Try to chmod /dev/shm/ to 777 and rerun the test case.

There's an old redhat bug that sets it to 755.

In any case, we should catch that error and provide a fallback.

Starou commented 4 years ago

Indeed, on that Debian Wheezy instance, for some reason, /dev/shm/ is mounted with the 755 flags instead of 777.

thedrow commented 4 years ago

What kind of fallback can we provide? Maybe we should catch that error and notify the user that it is possible that their /dev/shm permissions is incorrect?

AvnerCohen commented 4 years ago

@thedrow Any idea maybe what is the source of this regression? Can this be avoided all together?

Starou commented 4 years ago

What kind of fallback can we provide? Maybe we should catch that error and notify the user that it is possible that their /dev/shm permissions is incorrect?

Maybe in giving the actual path that raise the exception because it may vary from an OS to another ?

thedrow commented 4 years ago

@Starou Any idea how can we do that?

Starou commented 4 years ago

@thedrow: no, but I can look into It.

aliceminotto commented 4 years ago

I'm on xenial and see the same issue

thedrow commented 4 years ago

I'm on xenial and see the same issue

Did you check the permissions for /dev/shm?

aliceminotto commented 4 years ago

I'm on xenial and see the same issue

Did you check the permissions for /dev/shm?

They are drwxrwxrwt 2 root root 280 Apr 21 15:00 . which seems correct to me

thedrow commented 4 years ago

@AvnerCohen Did you check on your machines?

AvnerCohen commented 4 years ago

@thedrow First thing firsts, the issue around /dev/shm is not related here, you can see that in my original issue gthe permission denied is on - /tmp/pymp

I created a full repro in a docker, so should be pretty easy to move on now: https://github.com/AvnerCohen/repro-billiard-288

Once built, you can uncomment the downgrade of billiard and you'll see that this is fixed: https://github.com/AvnerCohen/repro-billiard-288/blob/master/requirements.txt#L3

I still have few tests to do and I think I can further narrow down the issue now, but wanted to share that maybe others are also impacted by it.

cysio528 commented 4 years ago

Have same issue on centos 7, As mentioned in first post, downgrading to billiard==3.6.0.0 solved the problem

thedrow commented 4 years ago

I don't have an idea why you get a permission error. This happens outside of Docker too right?

AvnerCohen commented 4 years ago

Correct, original issue is outside of docker. It is related to the specific umask assigned to the forked process: https://github.com/AvnerCohen/repro-billiard-288/blob/master/runner.py#L4

But I can't seem to figure out why this happens only on 3.6.1 and not on earlier versions.

thedrow commented 4 years ago

So maybe we can fix this by setting a different umask after forking?

pva commented 3 years ago

We have a similar problem with dev-python/billiard-3.6.3.0. The only difference is that we use gentoo linux and we need /run to be world writable:

[pid 33685] openat(AT_FDCWD, "/run/2IU0th", O_RDWR|O_CREAT|O_EXCL, 0600) = -1 EACCES (Permission denied)

Downgrade to 3.6.0 fixed this problem as well.

pva commented 3 years ago

I've made git bisect and found: 241c0d4782ba6c24bd5f2777ed5d1e084d39501c is the first bad commit

I've tried to revert this commit and yes, celery works with the current master without this commit.

thedrow commented 3 years ago

Why would that commit cause a permission issue? Also, are you sure you can't reproduce this bug? Have you tried multiple times?

pva commented 3 years ago

Yes, bug is perfectly reproducible and I have no idea how that commit breaks anything. My only guess is that it is somehow related to locking and that is why it causes regression. Anyway, as I said, revert back worked for me.

One important thing is that once I've mounted /dev/shm as tmpfs everything started to work again. So it's not only a permissions problem. /dev/shm should be tmpfs mounted.

AvnerCohen commented 3 years ago

I am not sure how relevant is it to other people and cases, in the repro I created (https://github.com/AvnerCohen/repro-billiard-288) the issue and the fix is:

import os
pid = os.fork()os.umask(0o177) ## Fails
#os.umask(0o477) ## works
os.system('celery -A worker worker --loglevel=info')

E.g. mask change to allow write/execute makes this work for us.

AvnerCohen commented 3 years ago

BTW - looking at the code changes in the faulty commit (https://github.com/celery/billiard/commit/241c0d4782ba6c24bd5f2777ed5d1e084d39501c) I am guessing the change is in Line 828

with on_ready_counter.get_lock():

Haven't drilled down, but I am guessing this is where a write/execute permission is required.

auvipy commented 2 years ago

what alternative should we use to get rid of that