Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.97k stars 560 forks source link

SIG_PENDING_DIE_COUNT kills program in new situation #14233

Open p5pRT opened 10 years ago

p5pRT commented 10 years ago

Migrated from rt.perl.org#123188 (status was 'open')

Searchable as RT123188$

p5pRT commented 10 years ago

From d.white@imperial.ac.uk

Created by d.white@imperial.ac.uk

This is a bug report for perl from d.white@​imperial.ac.uk\, generated with the help of perlbug 1.39 running under perl 5.18.2.

----------------------------------------------------------------- As in rejected bug #112404\, we have just encountered the SIG_PENDING_DIE_COUNT limit. On Perl 5.18.2 on Ubuntu 14.04 linux\, we have some locally developed highly threaded Perl code that controls a set of 40-80 firewalled desktop PCs in "exam mode". The controller manages a set of ssh connections to the client desktop PCs\, telling them when to start into exam mode\, then every N minutes telling them to take a copy of the local exam directory where the student is working\, then at the end tells the PCs to exit exam mode and reboot. It uses a pool of worker threads\, various "​: shared" variables\, and a Thread​::Queue of work-jobs (names of hosts to send a command to). It does no explicit signal handling.

This code has worked reliably across multiple Perl versions on multiple Ubuntu distros for approx 6 years\, but on Ubuntu 14.04\, almost immediately we hit the dreaded

"Maximal count of pending signals (120) exceeded"

inside Thread​::Queue line 70. This is in the dequeue() function where it appears to be using signals to build condition variables\, waiting and signalling on one.

We have rebuilt our own Perl 5.18.2 experimental version with

# define SIG_PENDING_DIE_COUNT 520

and invested approximately 8 hours of time in experiments on 80 desktop PCs\, and found that this completely resolves our issue\, but this is a horrid kludge\, given that we GUESSED the value 520​:-)

May I request that this code\, the rationale for which was described in​:

www.nntp.perl.org/group/perl.perl5.porters/2006/12/msg119236.html

[in summary​: added in Dec 2008 to Perl 5.8.8 to fix a specific bug on OS/2\, with an arbitrary limit of 120 pending signals]

makes no sense now and should be altered\, removed\, or reimplemented\, because it is actively breaking real code.

One possible minimally intrusive solution would be to turn the limit from a constant to a variable and add a command line option to perl to allow us to set the limit to an arbitrary value. We would be willing to write and submit such a patch if this would help.

Best Wishes Duncan White Systems Manager\, Computing Support Group\, Dept of Computing\, Imperial College London.

Perl Info ``` Flags: category=core severity=medium Site configuration information for perl 5.18.2: Configured by Debian Project at Thu Mar 27 18:28:21 UTC 2014. Summary of my perl5 (revision 5 version 18 subversion 2) configuration: Platform: osname=linux, osvers=3.2.0-58-generic, archname=x86_64-linux-gnu-thread-multi uname='linux brownie 3.2.0-58-generic #88-ubuntu smp tue dec 3 17:37:58 utc 2013 x86_64 x86_64 x86_64 gnulinux ' config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -Dldflags= -Wl,-Bsymbolic-functions -Wl,-z,relro -Dlddlflags=-shared -Wl,-Bsymbolic-functions -Wl,-z,relro -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.18 -Darchlib=/usr/lib/perl/5.18 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.18.2 -Dsitearch=/usr/local/lib/perl/5.18.2 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Duse64bitint -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Ui_libutil -Uversiononly -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.18.2 -des' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fstack-protector -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -g', cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fstack-protector -fno-strict-aliasing -pipe -I/usr/local/include' ccversion='', gccversion='4.8.2', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='cc', ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib /usr/lib libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt perllibs=-ldl -lm -lpthread -lc -lcrypt libc=, so=so, useshrplib=true, libperl=libperl.so.5.18.2 gnulibc_version='2.19' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib -fstack-protector' Locally applied patches: DEBPKG:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN. DEBPKG:debian/db_file_ver - http://bugs.debian.org/340047 Remove overly restrictive DB_File version check. DEBPKG:debian/doc_info - Replace generic man(1) instructions with Debian-specific information. DEBPKG:debian/enc2xs_inc - http://bugs.debian.org/290336 Tweak enc2xs to follow symlinks and ignore missing @INC directories. DEBPKG:debian/errno_ver - http://bugs.debian.org/343351 Remove Errno version check due to upgrade problems with long-running processes. DEBPKG:debian/libperl_embed_doc - http://bugs.debian.org/186778 Note that libperl-dev package is required for embedded linking DEBPKG:fixes/respect_umask - Respect umask during installation DEBPKG:debian/writable_site_dirs - Set umask approproately for site install directories DEBPKG:debian/extutils_set_libperl_path - EU:MM: Set location of libperl.a to /usr/lib DEBPKG:debian/no_packlist_perllocal - Don't install .packlist or perllocal.pod for perl or vendor DEBPKG:debian/prefix_changes - Fiddle with *PREFIX and variables written to the makefile DEBPKG:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets. DEBPKG:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor. DEBPKG:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy. DEBPKG:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable. DEBPKG:debian/mod_paths - Tweak @INC ordering for Debian DEBPKG:debian/module_build_man_extensions - http://bugs.debian.org/479460 Adjust Module::Build manual page extensions for the Debian Perl policy DEBPKG:debian/prune_libs - http://bugs.debian.org/128355 Prune the list of libraries wanted to what we actually need. DEBPKG:fixes/net_smtp_docs - [rt.cpan.org #36038] http://bugs.debian.org/100195 Document the Net::SMTP 'Port' option DEBPKG:debian/perlivp - http://bugs.debian.org/510895 Make perlivp skip include directories in /usr/local DEBPKG:debian/cpanplus_definstalldirs - http://bugs.debian.org/533707 Configure CPANPLUS to use the site directories by default. DEBPKG:debian/cpanplus_config_path - Save local versions of CPANPLUS::Config::System into /etc/perl. DEBPKG:debian/deprecate-with-apt - http://bugs.debian.org/702096 Point users to Debian packages of deprecated core modules DEBPKG:debian/squelch-locale-warnings - http://bugs.debian.org/508764 Squelch locale warnings in Debian package maintainer scripts DEBPKG:debian/skip-upstream-git-tests - Skip tests specific to the upstream Git repository DEBPKG:debian/patchlevel - http://bugs.debian.org/567489 List packaged patches for 5.18.2-2ubuntu1 in patchlevel.h DEBPKG:debian/skip-kfreebsd-crash - http://bugs.debian.org/628493 [perl #96272] Skip a crashing test case in t/op/threads.t on GNU/kFreeBSD DEBPKG:fixes/document_makemaker_ccflags - http://bugs.debian.org/628522 [rt.cpan.org #68613] Document that CCFLAGS should include $Config{ccflags} DEBPKG:debian/find_html2text - http://bugs.debian.org/640479 Configure CPAN::Distribution with correct name of html2text DEBPKG:debian/hurd_test_skip_stack - http://bugs.debian.org/650175 Disable failing GNU/Hurd tests dist/threads/t/stack.t DEBPKG:fixes/manpage_name_Test-Harness - http://bugs.debian.org/650451 [rt.cpan.org #73399] cpan/Test-Harness: add NAME headings in modules with POD DEBPKG:debian/makemaker-pasthru - http://bugs.debian.org/660195 [rt.cpan.org #28632] Make EU::MM pass LD through to recursive Makefile.PL invocations DEBPKG:debian/perl5db-x-terminal-emulator.patch - http://bugs.debian.org/668490 Invoke x-terminal-emulator rather than xterm in perl5db.pl DEBPKG:debian/cpan-missing-site-dirs - http://bugs.debian.org/688842 Fix CPAN::FirstTime defaults with nonexisting site dirs if a parent is writable DEBPKG:fixes/memoize_storable_nstore - [rt.cpan.org #77790] http://bugs.debian.org/587650 Memoize::Storable: respect 'nstore' option not respected DEBPKG:fixes/net_ftp_failed_command - [rt.cpan.org #37700] http://bugs.debian.org/491062 Net::FTP: cope gracefully with a failed command DEBPKG:fixes/perlbug-patchlist - [3541c11] http://bugs.debian.org/710842 [perl #118433] Make perlbug look up the list of local patches at run time DEBPKG:fixes/module_metadata_security_doc - [68cdd4b] CVE-2013-1437 documentation fix DEBPKG:fixes/module_metadata_taint_fix - [bff978f] http://bugs.debian.org/722210 [rt.cpan.org #88576] untaint version, if needed, in Module::Metadata DEBPKG:fixes/IPC-SysV-spelling - http://bugs.debian.org/730558 [rt.cpan.org #86736] Fix spelling of IPC_CREAT in IPC-SysV documentation DEBPKG:fixes/fix-undef-source - @INC for perl 5.18.2: /homes/dcw/perl5/lib/perl5/x86_64-linux-gnu-thread-multi /homes/dcw/perl5/lib/perl5/x86_64-linux-gnu-thread-multi /homes/dcw/perl5/lib/perl5 /etc/perl /usr/local/lib/perl/5.18.2 /usr/local/share/perl/5.18.2 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.18 /usr/share/perl/5.18 /usr/local/lib/site_perl . Environment for perl 5.18.2: HOME=/homes/dcw LANG=en_GB.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH=/usr/lib:/usr/openwin/lib:/usr/dt/lib:/opt/lib:/usr/lib/x86_64-linux-gnu:/usr/local/pgsql/lib:/opt/gurobi562/linux64/lib:/usr/local/pgsql/lib:/opt/gurobi562/linux64/lib LOGDIR (unset) PATH=/homes/dcw/perl5/bin:/homes/dcw/c-tools/bin:/homes/dcw/c-tools/bin/x86_64:/usr/lib/ccache:/usr/local/condor/bin:/homes/dcw/bin:/homes/dcw/bin/x86_64:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/sbin:/usr/sbin:/usr/games:.:/opt/gurobi562/linux64/bin:/homes/dcw/.rvm/bin PERL5LIB=/homes/dcw/perl5/lib/perl5/x86_64-linux-gnu-thread-multi:/homes/dcw/perl5/lib/perl5 PERL_BADLANG (unset) PERL_LOCAL_LIB_ROOT=/homes/dcw/perl5 PERL_MB_OPT=--install_base /homes/dcw/perl5 PERL_MM_OPT=INSTALL_BASE=/homes/dcw/perl5 SHELL=/bin/tcsh ```
p5pRT commented 10 years ago

From @jkeenan

On Wed Nov 12 07​:09​:58 2014\, d.white@​imperial.ac.uk wrote​:

This is a bug report for perl from d.white@​imperial.ac.uk\, generated with the help of perlbug 1.39 running under perl 5.18.2.

----------------------------------------------------------------- As in rejected bug #112404\, we have just encountered the SIG_PENDING_DIE_COUNT limit. On Perl 5.18.2 on Ubuntu 14.04 linux\, we have some locally developed highly threaded Perl code that controls a set of 40-80 firewalled desktop PCs in "exam mode". The controller manages a set of ssh connections to the client desktop PCs\, telling them when to start into exam mode\, then every N minutes telling them to take a copy of the local exam directory where the student is working\, then at the end tells the PCs to exit exam mode and reboot. It uses a pool of worker threads\, various "​: shared" variables\, and a Thread​::Queue of work-jobs (names of hosts to send a command to). It does no explicit signal handling.

This code has worked reliably across multiple Perl versions on multiple Ubuntu distros for approx 6 years\, but on Ubuntu 14.04\, almost immediately we hit the dreaded

"Maximal count of pending signals (120) exceeded"

inside Thread​::Queue line 70. This is in the dequeue() function where it appears to be using signals to build condition variables\, waiting and signalling on one.

We have rebuilt our own Perl 5.18.2 experimental version with

# define SIG_PENDING_DIE_COUNT 520

and invested approximately 8 hours of time in experiments on 80 desktop PCs\, and found that this completely resolves our issue\, but this is a horrid kludge\, given that we GUESSED the value 520​:-)

May I request that this code\, the rationale for which was described in​:

www.nntp.perl.org/group/perl.perl5.porters/2006/12/msg119236.html

[in summary​: added in Dec 2008 to Perl 5.8.8 to fix a specific bug on OS/2\, with an arbitrary limit of 120 pending signals]

makes no sense now and should be altered\, removed\, or reimplemented\, because it is actively breaking real code.

One possible minimally intrusive solution would be to turn the limit from a constant to a variable and add a command line option to perl to allow us to set the limit to an arbitrary value. We would be willing to write and submit such a patch if this would help.

Best Wishes Duncan White Systems Manager\, Computing Support Group\, Dept of Computing\, Imperial College London.

1. Can you suggest any way to reproduce this problem outside of your customized environment?

2. Have you ruled out the possibility that the problem lies in the version of Ubuntu you are using (which I also use) as distinct from Perl? If so\, how have you made that rule-out?

3. Is the 'perl -V' output attached to your bug report from the machine where you observed the problems? If not\, can you provide that output?

Thank you very much. Jim Keenan

-- James E Keenan (jkeenan@​cpan.org)

p5pRT commented 10 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 10 years ago

From @Leont

On Wed\, Nov 12\, 2014 at 4​:09 PM\, via RT \perlbug\-followup@​perl\.org wrote​:

As in rejected bug #112404\, we have just encountered the SIG_PENDING_DIE_COUNT limit. On Perl 5.18.2 on Ubuntu 14.04 linux\, we have some locally developed highly threaded Perl code that controls a set of 40-80 firewalled desktop PCs in "exam mode". The controller manages a set of ssh connections to the client desktop PCs\, telling them when to start into exam mode\, then every N minutes telling them to take a copy of the local exam directory where the student is working\, then at the end tells the PCs to exit exam mode and reboot. It uses a pool of worker threads\, various "​: shared" variables\, and a Thread​::Queue of work-jobs (names of hosts to send a command to). It does no explicit signal handling.

This code has worked reliably across multiple Perl versions on multiple Ubuntu distros for approx 6 years\, but on Ubuntu 14.04\, almost immediately we hit the dreaded

"Maximal count of pending signals (120) exceeded"

inside Thread​::Queue line 70. This is in the dequeue() function where it appears to be using signals to build condition variables\, waiting and signalling on one.

Thread​::Queue doesn't use signals. It would be helpful to strace your program\, to figure out what signal you're getting and where it comes from (I should note\, this can only happen for signals that you set a signal handler for but have not blocked).

We have rebuilt our own Perl 5.18.2 experimental version with

# define SIG_PENDING_DIE_COUNT 520

and invested approximately 8 hours of time in experiments on 80 desktop PCs\, and found that this completely resolves our issue\, but this is a horrid kludge\, given that we GUESSED the value 520​:-)

May I request that this code\, the rationale for which was described in​:

It's just as arbitrary as the previous value. Odds are it would trigger again under a higher load. I'd rather understand the issue before paving over it.

makes no sense now and should be altered\, removed\, or reimplemented\, because it is actively breaking real code.

Probably.

One possible minimally intrusive solution would be to turn the limit from a constant to a variable and add a command line option to perl to allow us to set the limit to an arbitrary value. We would be willing to write and submit such a patch if this would help.

There is some precedent for such things (such as the PERL_SIGNALS environmental variable).

Leon

p5pRT commented 10 years ago

From d.white@imperial.ac.uk

Dear James\,

On 13/11/14 01​:41\, James E Keenan via RT wrote​:

On Wed Nov 12 07​:09​:58 2014\, d.white@​imperial.ac.uk wrote​:

This is a bug report for perl from d.white@​imperial.ac.uk\, generated with the help of perlbug 1.39 running under perl 5.18.2.

----------------------------------------------------------------- As in rejected bug #112404\, we have just encountered the SIG_PENDING_DIE_COUNT limit. On Perl 5.18.2 on Ubuntu 14.04 linux\, we have some locally developed highly threaded Perl code that..

1. Can you suggest any way to reproduce this problem outside of your customized environment?

I was thinking about that\, I might be able to whip up a test version that only does the thread and thread queue work\, by extracting a minimal part of the existing system (and none of the client code)\, and then see whether that suffers the problem too. Whether ssh can be eliminated is unclear; I'd love to of course! Leave this with me over the weekend to think further on.

2. Have you ruled out the possibility that the problem lies in the version of Ubuntu you are using (which I also use) as distinct from Perl? If so\, how have you made that rule-out?

That is\, of course\, totally possible at several levels.

- Perl 5.18.2 as packaged by Ubuntu has many Ubuntu-specific patches.   We could build a vanilla perl.org Perl 5.18.2 and try that (with   the unchanged limit of 120 pending signals) if you like.

- Underneath Perl itself\, gcc could compile Perl slightly differently   and change something subtle wrt speed\, or the linux kernel could   handle signals slightly differently\, deliver them slightly faster.   There's really nothing we can do to eliminate such possibilities\, we   are an academic department in mid-term and academic year\, we can't   change C compiler\, linux kernel and/or OS as a test.

I should mention that all the PCs in the tests (the controller and it's 79 clients themselves)\, are unchanged hardware from last year (when we did not experience the problem with Perl 5.14 on Ubuntu 13.04). So at least we can eliminate new faster hardware delivering signals faster and tripping the limit​:-)

I would still argue that the presence of an arbitrary 8 year old limit of 120 pending signals is unprincipled (as the original patch submitter said back in 2008!) and is therefore a bad thing.

3. Is the 'perl -V' output attached to your bug report from the machine where you observed the problems? If not\, can you provide that output?

Yes\, it is from the machine where we observed the problem. I checked that carefully​:-) We have also experienced the problem running the controller software on several identically configured PCs in our lab.

Thank you very much. Jim Keenan

Thanks for getting in touch so quickly.

cheers Duncan White

p5pRT commented 10 years ago

From @craigberry

On Thu\, Nov 13\, 2014 at 12​:18 PM\, Duncan White \d\.white@​imperial\.ac\.uk wrote​:

I would still argue that the presence of an arbitrary 8 year old limit of 120 pending signals is unprincipled (as the original patch submitter said back in 2008!) and is therefore a bad thing.

Yep​:

http​://perl5.git.perl.org/perl.git/commit/2563cec55ae473562ff3ccda41cd10289db419be?f=mg.c

I agree it's not nice to have these guessed-at\, hard-coded numbers\, but presumably the same limit was in previous Perl versions that worked for you\, so there must be more to the story. As tempting as it is to simply remove the limit\, there is no reason to believe the original reason for its existence has gone away. A resource exhaustion that leads to a deadlock is a lot nastier than a program dying\, even if the limit chosen to prevent that on OS/2 is quite inappropriate for other systems. It's also probably quite difficult to guess at a better value that is portable.

FWIW\, in general\, newer Perls should have fewer pending signals because PERL_ASYNC_CHECK() is called in places it wasn't earlier (sorry\, haven't researched the exact timeline).

A quick look at threads​::shared (used by Thread​::Queue) indicates that cond_signal will be using either condition_signal or pthread_cond_signal under the hood. Which makes me skeptical whether what it's doing would even be incrementing PL_sig_pending and contributing to the limit being hit.

So I think the real question here is what causes a lot more signals to be pending than was the case when this worked. If there is a particular Perl op taking a lot longer than it used to\, that would be one way because the PL_sig_pending count increments until the op completes and the pending signals are delivered.

If you are stuck building your own Perl from source for now\, you should be able to get what you need by configuring like so​:

./Configure -Accflags=-DSIG_PENDING_DIE_COUNT=N

where N is 520 or whatever number works for you.

p5pRT commented 9 years ago

From d.white@imperial.ac.uk

Dear all\,

On 14/11/14 00​:51\, Craig Berry via RT wrote​:

On Thu\, Nov 13\, 2014 at 12​:18 PM\, Duncan White \d\.white@​imperial\.ac\.uk wrote​:

I would still argue that the presence of an arbitrary 8 year old limit of 120 pending signals is unprincipled (as the original patch submitter said back in 2008!) and is therefore a bad thing.

Yep​:

http​://perl5.git.perl.org/perl.git/commit/2563cec55ae473562ff3ccda41cd10289db419be?f=mg.c

I agree it's not nice to have these guessed-at\, hard-coded numbers\, but presumably the same limit was in previous Perl versions that worked for you\, so there must be more to the story. As tempting as it is to simply remove the limit\, there is no reason to believe the original reason for its existence has gone away. A resource exhaustion that leads to a deadlock is a lot nastier than a program dying\, even if the limit chosen to prevent that on OS/2 is quite inappropriate for other systems. It's also probably quite difficult to guess at a better value that is portable.

Understood.

FWIW\, in general\, newer Perls should have fewer pending signals because PERL_ASYNC_CHECK() is called in places it wasn't earlier (sorry\, haven't researched the exact timeline).

A quick look at threads​::shared (used by Thread​::Queue) indicates that cond_signal will be using either condition_signal or pthread_cond_signal under the hood. Which makes me skeptical whether what it's doing would even be incrementing PL_sig_pending and contributing to the limit being hit.

So I think the real question here is what causes a lot more signals to be pending than was the case when this worked. If there is a particular Perl op taking a lot longer than it used to\, that would be one way because the PL_sig_pending count increments until the op completes and the pending signals are delivered.

I agree we need to investigate this further. We are working on a cutdown set of programs that we can share with you guys so that you can try to reproduce the problem\, and we are also investigating vanilla perl vs ubuntu perl (to try to eliminate at least 1 degree of freedom).

If you are stuck building your own Perl from source for now\, you should be able to get what you need by configuring like so​:

./Configure -Accflags=-DSIG_PENDING_DIE_COUNT=N

where N is 520 or whatever number works for you.

That's useful info\, we built a vanilla 5.18.2 perl with the #define changed to 520\, but a configure argument is much neater; my colleagues used that version on Friday for the next electronic exam\, and even with that limit they hit it once\, which impressed me! This is further evidence that just changing the limit is a bad idea\, understanding is better.

I'll send more info when I have it.

cheers Duncan

p5pRT commented 9 years ago

From d.white@imperial.ac.uk

I wrote​:

I agree we need to investigate this further. We are working on a cutdown set of programs that we can share with you guys so that you can try to reproduce the problem\, and we are also investigating vanilla perl vs ubuntu perl (to try to eliminate at least 1 degree of freedom).

If you are stuck building your own Perl from source for now\, you should be able to get what you need by configuring like so​:

./Configure -Accflags=-DSIG_PENDING_DIE_COUNT=N

where N is 520 or whatever number works for you.

That's useful info\, we built a vanilla 5.18.2 perl with the #define changed to 520\, but a configure argument is much neater; my colleagues used that version on Friday for the next electronic exam\, and even with that limit they hit it once\, which impressed me! This is further evidence that just changing the limit is a bad idea\, understanding is better.

I'll send more info when I have it.

Update​: We have done a number of additional tests and investigations\, we now heavily suspect a change introduced in Perl 5.16

http​://search.cpan.org/dist/perl-5.16.0/pod/perldelta.pod

"system now temporarily blocks the SIGCHLD signal handler\, to prevent the signal handler from stealing the exit status [perl #105700]."

https://rt.perl.org/Public/Bug/Display.html?id=105700

talks of system() returning -1 [which we see\, btw] and blocking signals. In our fertile imaginations\, blocking signals while system() runs (our code\, remember\, does LOTS of system() calls\, mainly scp commands) could easily lead to large numbers of pending signals!

In more detail​:

- we have built and used several Perl 5.18.2s (Ubuntu pkg build;   vanilla build from perl.org\, 520-pending-signals version of vanilla)   and they ALL show the bug.

- we have built a vanilla Perl 5.14 and CANNOT get that to show the bug\,   despite intensive testing. We hypothesize that Perl 5.14 DOES NOT   HAVE the problem\, which matches the fact that last year we were using   Perl 5.14 on Ubuntu 13.04 and never experienced this problem.

- we are just building a vanilla Perl 5.16 and will test it shortly;   if we are correct about [perl #105700] we expect to see the bug.   Of course I will confirm/deny that as soon as I can\, given it's   importance!

- we have built several cutdown versions of our code\, all of which   strip out the "site-specific exam system"\, that we can share with   you\, and we are bisecting between one which shows the bug and one   does not. These versions cut down the original exam system from   2500 lines of perl to between 1400-1800 lines of perl. We are still   trying to cut it down further\, but please say if you want us to share   our code despite it's size.

- finally\, I have built yet-another Perl 5.18.2 with a hack patch I   wrote to write a time series of #pending signals vs epoch time\,   this shows #pending signals varying randomly between 1 and 77\,   largely independent of number of worker threads used. This may   add evidence to our idea that blocking pending signals while   system() runs may well lead to much large numbers of pending signals   than the "120 is ludicrous" original context envisaged.

cheers Duncan

p5pRT commented 9 years ago

From d.white@imperial.ac.uk

Dear all\,

On 19/11/14 13​:34\, I (Duncan White) wrote​:

Update​: We have done a number of additional tests and investigations\, we now heavily suspect a change introduced in Perl 5.16

http​://search.cpan.org/dist/perl-5.16.0/pod/perldelta.pod

"system now temporarily blocks the SIGCHLD signal handler\, to prevent the signal handler from stealing the exit status [perl #105700]."

https://rt.perl.org/Public/Bug/Display.html?id=105700

talks of system() returning -1 [which we see\, btw] and blocking signals. In our fertile imaginations\, blocking signals while system() runs (our code\, remember\, does LOTS of system() calls\, mainly scp commands) could easily lead to large numbers of pending signals!

In more detail​:

- we have built and used several Perl 5.18.2s (Ubuntu pkg build; vanilla build from perl.org\, 520-pending-signals version of vanilla) and they ALL show the bug.

I should stress here that this means that our earlier belief that the 520 pending signals version "reliably fixes our problem" was wrong; under greater load ALL 5.18.2 builds experience the problem\, whether the limit is set to 120 or 520.

- we have built a vanilla Perl 5.14 and CANNOT get that to show the bug\, despite intensive testing. We hypothesize that Perl 5.14 DOES NOT HAVE the problem\, which matches the fact that last year we were using Perl 5.14 on Ubuntu 13.04 and never experienced this problem.

- we are just building a vanilla Perl 5.16 and will test it shortly; if we are correct about [perl #105700] we expect to see the bug. Of course I will confirm/deny that as soon as I can\, given it's importance!

I spent much of this afternoon running tests using our vanilla Perl 5.16 under conditions of high load\, and I can confirm that Perl 5.16 DOES suffer the bug.

Thus\, our current position is that Perl 5.14 does NOT suffer the bug\, while vanilla perl 5.16 and several Perl 5.18 builds *DO* suffer the bug. We further speculate that

https://rt.perl.org/Public/Bug/Display.html?id=105700

(introduced in Perl 5.16) may be the cause of the changed behaviour.

We will continue to experiment with our cutdown versions\, attempting to derive the smallest possible test program exhibiting the "too many pending signals" behaviour.

cheers Duncan

p5pRT commented 9 years ago

From @Leont

On Wed\, Nov 19\, 2014 at 2​:34 PM\, Duncan White \d\.white@​imperial\.ac\.uk wrote​:

Update​: We have done a number of additional tests and investigations\, we now heavily suspect a change introduced in Perl 5.16

http​://search.cpan.org/dist/perl-5.16.0/pod/perldelta.pod

"system now temporarily blocks the SIGCHLD signal handler\, to prevent the signal handler from stealing the exit status [perl #105700]."

https://rt.perl.org/Public/Bug/Display.html?id=105700

talks of system() returning -1 [which we see\, btw] and blocking signals. In our fertile imaginations\, blocking signals while system() runs (our code\, remember\, does LOTS of system() calls\, mainly scp commands) could easily lead to large numbers of pending signals!

If a signal is blocked\, it will not be delivered to perl in the first place until it is unblocked\, so it won't lead to signals pending in userland (it's instead pending in kernel-land). I don't see how that can explain what you're observing by itself.

Also\, shortly after the system() call the delayed signal handler should run (probably in pp_nextstate or pp_and)\, this does not sound like a logical explanation. You could try to compile with that commit reverted and see what happens though\, there may be something non-obvious going on.

Can you «strace -e signal» your program? That should tell you more about what's going on.

Leon

p5pRT commented 9 years ago

From @craigberry

On Wed\, Nov 19\, 2014 at 11​:17 AM\, Leon Timmermans \fawaka@​gmail\.com wrote​:

If a signal is blocked\, it will not be delivered to perl in the first place until it is unblocked\, so it won't lead to signals pending in userland (it's instead pending in kernel-land). I don't see how that can explain what you're observing by itself.

I wonder if the mix of signals and threads is involved. We're blocking and then unblocking SIGCHLD using sigprocmask rather than pthread_sigprocmask. If you unblock SIGCHLD process-wide and have 100 threads in flight\, do they all get the signal? More specifically\, does *each* thread increment its own copy of PL_sig_pending whenever *any* thread gets a signal? Not sure how many threads the OP has\, but I got the impression it was more than a handful (one per exam taker or something).

Just thinking out loud and haven't really analyzed anything.

p5pRT commented 9 years ago

From @Leont

On Wed\, Nov 19\, 2014 at 7​:28 PM\, Craig A. Berry \craig\.a\.berry@​gmail\.com wrote​:

I wonder if the mix of signals and threads is involved. We're blocking and then unblocking SIGCHLD using sigprocmask rather than pthread_sigprocmask. If you unblock SIGCHLD process-wide and have 100 threads in flight\, do they all get the signal?

sigprocmask is unspecified in a multithreaded program. Signal masks are per thread though\, not per process. AFAIK on 1​:1 threading implementations (like Linux) thread_sigmask is sigprocmask; on M​::N implementations (OpenBSD and old Solaris) it may mask an OS thread instead of a userland thread (which is not what you want).

More specifically\, does *each* thread increment its own copy of PL_sig_pending whenever *any* thread gets a signal?

A process targetted signal (which is almost any signal except faults and pthread_kill) may be delivered to any process\, but it will only be delivered once. Most operating systems have a preference of delivering to the main thread (it makes background threads in naive programs easier).

Actually\, that may well be the issue here. The child threads generate signals\, but the main thread gets them delivered. If it's blocking/restarting for whatever reason\, they will keep pending there even though the child threads would have been able to handle them.

Not sure how many threads the OP has\, but

I got the impression it was more than a handful (one per exam taker or something).

Threads and signals are a problematic combination. You really don't want to combine them when you can avoid it.

Leon

p5pRT commented 9 years ago

From d.white@imperial.ac.uk

On 19/11/14 13​:34\, I (Duncan White) wrote​: I spent much of this afternoon running tests using our vanilla Perl 5.16 under conditions of high load\, and I can confirm that Perl 5.16 DOES suffer the bug.

Thus\, our current position is that Perl 5.14 does NOT suffer the bug\, while vanilla perl 5.16 and several Perl 5.18 builds *DO* suffer the bug. We further speculate that

https://rt.perl.org/Public/Bug/Display.html?id=105700

(introduced in Perl 5.16) may be the cause of the changed behaviour.

We will continue to experiment with our cutdown versions\, attempting to derive the smallest possible test program exhibiting the "too many pending signals" behaviour.

Updates of our investigations\, plus answering several points Leon and Craig have made​:

- Leon suggested we run strace -e signal on our code\, done\, attached.   Shows lots of signal blocking and lots of sigchlds. Attached\,   although I'm not sure what to deduce from it in this case.

- Leon suggested that signals and threads may interact badly. While   I appreciate that Perl threads are "deprecated"\, I should stress   that my threaded Perl app has worked reliably for >6 years and   is *still* working reliably when the arbitrary limit of "120   pending signals" is not breached.

- Craig asked about how many threads I have\, yes\, typically 40-80\,   one per controlled "slave" PC. All of my recent tests used 80 threads   and 80 machines\, and the probability of crashing does appear to   increase with number of threads.

- important new info​: Leon suggested that we compile a Perl 5.18.2 with   patch 105700 removed and report back​: we've done that\, and the bug   DOES NOT APPEAR.

  We've run over 1000 "dump all machines" cycles under conditions of   high load. Under those conditions every run with the system Perl   5.18.2 (or any other 5.18.2 build\, eg vanilla) hits the "more than   120 pending signals" error within 100 runs.

  This is substantial evidence that somehow the (sensible) block-   sigchld-in-system patch introduced in Perl 5.16 by Leon is   interacting badly with the "more than 120 pending signals"   patch. Hypothetically\, I would say that the block-sigchld-in-system   patch is "magnifying" or "concentrating" the number of pending   signals\, triggering the "more than 120 pending signals" patch that   was already there. But that last sentence is speculation.

- To summarise the evidence so far​:

  - we have NOT SEEN the "more than 120 pending signals" bug   hitting us using any Perl 5.14 build.

  - a vanilla Perl 5.16 build DOES show this bug.

  - only one version of perl 5.18.2 does NOT SHOW this bug - the is   the above mentioned "with patch 105700 removed".

  - every OTHER version of perl 5.18.2 DOES show this bug.

- The second strand of investigation is cutting down our exam-control   program so I can share it with you all. I have reduced 2500 lines   of Perl to ~850 lines so far\, ripping out all the "client reconfigures   itself into exam mode" logic. I attach a tarball of this code   for you now\, it contains a README explaining what I think you'll   need to test it (a mere 40-80 hosts​:-)). Please ask if anything   is not clear.

- The final strand of investigation is that I have built a version of   Perl 5.18.2 with a diagnostic patch to log a time series of numbers   of pending signals over time\, I will now build another version of   5.18.2 with that patch added AND the block-sigchild-in-system patch   removed. I will attempt to gather two sets of timeseries data and   graph them. I **expect** of course that these will show the   "magnification of pending signals" property - but of course I don't   know that yet​:-)

My "working conclusion" is still that the 120 pending signals patch is utterly arbitrary and stupid and (at least in my situation) is tripping when the Perl interpreter is in no difficulty at all\, but that Perl 5.16 and later have somehow "magnified" this effect horribly.

cheers Duncan

p5pRT commented 9 years ago

From d.white@imperial.ac.uk

log05-strace

p5pRT commented 9 years ago

From d.white@imperial.ac.uk

08a-share-with-perlguys.tgz

p5pRT commented 9 years ago

From @Leont

On Fri\, Nov 21\, 2014 at 4​:33 PM\, Duncan White \d\.white@​imperial\.ac\.uk wrote​:

- Leon suggested we run strace -e signal on our code\, done\, attached. Shows lots of signal blocking and lots of sigchlds. Attached\, although I'm not sure what to deduce from it in this case.

Not much more than confirming what we already suspected :-/

- Leon suggested that signals and threads may interact badly. While I appreciate that Perl threads are "deprecated"\, I should stress that my threaded Perl app has worked reliably for >6 years and is *still* working reliably when the arbitrary limit of "120 pending signals" is not breached.

I didn't mean to suggest this wasn't a serious issue\, we do take regressions serious.

- important new info​: Leon suggested that we compile a Perl 5.18.2 with patch 105700 removed and report back​: we've done that\, and the bug DOES NOT APPEAR.

Yes\, that is very relevant.

  This is substantial evidence that somehow the (sensible) block-

sigchld-in-system patch introduced in Perl 5.16 by Leon is interacting badly with the "more than 120 pending signals" patch. Hypothetically\, I would say that the block-sigchld-in-system patch is "magnifying" or "concentrating" the number of pending signals\, triggering the "more than 120 pending signals" patch that was already there. But that last sentence is speculation.

Agreed. My hypothesis is that it's because of signals being delivered to other threads than the one that caused them.

- The second strand of investigation is cutting down our exam-control program so I can share it with you all. I have reduced 2500 lines of Perl to ~850 lines so far\, ripping out all the "client reconfigures itself into exam mode" logic. I attach a tarball of this code for you now\, it contains a README explaining what I think you'll need to test it (a mere 40-80 hosts​:-)). Please ask if anything is not clear.

I have attached a tiny program that shows exactly this problem. The parameters may be different on other computers but for me 25​:10 works. It prints out the number of perl-level signal handlers run\, this suggests signals are getting merged on a Perl level (they are far too low).

- The final strand of investigation is that I have built a version of Perl 5.18.2 with a diagnostic patch to log a time series of numbers of pending signals over time\, I will now build another version of 5.18.2 with that patch added AND the block-sigchild-in-system patch removed. I will attempt to gather two sets of timeseries data and graph them. I **expect** of course that these will show the "magnification of pending signals" property - but of course I don't know that yet​:-)

Could be interesting. The count is per thread\, so some kind of thread id with each such message may be useful.

Leon

p5pRT commented 9 years ago

From @Leont

test

p5pRT commented 9 years ago

From d.white@imperial.ac.uk

Dear Leon\,

On 21/11/14 17​:44\, Leon Timmermans wrote​:

On Fri\, Nov 21\, 2014 at 4​:33 PM\, Duncan White \<d.white@​imperial.ac.uk \mailto&#8203;:d\.white@&#8203;imperial\.ac\.uk> wrote​:

\- Leon suggested we run strace \-e signal on our code\, done\, attached\.
  Shows lots of signal blocking and lots of sigchlds\.  Attached\,
  although I'm not sure what to deduce from it in this case\.

Not much more than confirming what we already suspected :-/

sure​:-)

\- Leon suggested that signals and threads may interact badly\.  While
  I appreciate that Perl threads are "deprecated"\, I should stress
  that my threaded Perl app has worked reliably for >6 years and
  is \*still\* working reliably when the arbitrary limit of "120
  pending signals" is not breached\.

I didn't mean to suggest this wasn't a serious issue\, we do take regressions serious.

I wasn't criticising\, I just wanted to disspell the "here be dragons" feeling I got from your hint​:-)

\- important new info&#8203;: Leon suggested that we compile a Perl 5\.18\.2 with
  patch 105700 removed and report back&#8203;: we've done that\, and the bug
  DOES NOT APPEAR\.

Yes\, that is very relevant.

  This is substantial evidence that somehow the \(sensible\) block\-
  sigchld\-in\-system patch introduced in Perl 5\.16 by Leon is
  interacting badly with the "more than 120 pending signals"
  patch\.  Hypothetically\, I would say that the block\-sigchld\-in\-system
  patch is "magnifying" or "concentrating" the number of pending
  signals\, triggering the "more than 120 pending signals" patch that
  was already there\.  But that last sentence is speculation\.

Agreed. My hypothesis is that it's because of signals being delivered to other threads than the one that caused them.

Fascinating! could well be right.

I have attached a tiny program that shows exactly this problem. The parameters may be different on other computers but for me 25​:10 works. It prints out the number of perl-level signal handlers run\, this suggests signals are getting merged on a Perl level (they are far too low).

Impressive! now that's a proper test case\, showing the problem quickly! I confirm that "25​:10" and "20​:10" trigger the problem for me with our system perl.

\- The final strand of investigation is that I have built a version of
  Perl 5\.18\.2 with a diagnostic patch to log a time series of numbers
  of pending signals over time\, I will now build another version of
  5\.18\.2 with that patch added AND the block\-sigchild\-in\-system patch
  removed\.  I will attempt to gather two sets of timeseries data and
  graph them\.  I \*\*expect\*\* of course that these will show the
  "magnification of pending signals" property \- but of course I don't
  know that yet&#8203;:\-\)

Could be interesting. The count is per thread\, so some kind of thread id with each such message may be useful.

Umm\, as yet I don't know how (in mg.c) to access the Perl thread id\, if you could tell me that I'll give it a try. I had enough difficulty realising to use Perlio_printf etc rather than fprintf​:-)

I now have some time series data\, the (utterly crap) diagnostic patch that generates it\, two time series runs for about a minute\, and the gnuplot commands to draw a graph\, and a postscript form of the graph (output.ps)\, all in the attached tarball. I should say that the perl interpreters with the diag patch are really unreliable\, they often hang when running the "server" version in the earlier tarball; to get the two timeseries I've plotted\, I had to ctrl-c and rerun about 5 times​:-)

But I think the data is sound\, and shows the "magnification effect" brilliantly! see the README in the timeseries tarball for more details.

Leon

Good luck tracking things down\, I'm off home now. I'll intermittently check email over the weekend\, and catch up properly on Monday.

cheers Duncan

p5pRT commented 9 years ago

From d.white@imperial.ac.uk

timeseries.tgz

p5pRT commented 9 years ago

From @Leont

On Fri\, Nov 21\, 2014 at 6​:44 PM\, Leon Timmermans \fawaka@&#8203;gmail\.com wrote​:

It prints out the number of perl-level signal handlers run\, this suggests signals are getting merged on a Perl level (they are far too low).

Except that I had already removed that\, since it didn't prove to be useful...

Leon

p5pRT commented 9 years ago

From @craigberry

On Fri\, Nov 21\, 2014 at 12​:42 PM\, Duncan White \d\.white@&#8203;imperial\.ac\.uk wrote​:

Dear Leon\,

On 21/11/14 17​:44\, Leon Timmermans wrote​:

I have attached a tiny program that shows exactly this problem. The parameters may be different on other computers but for me 25​:10 works. It prints out the number of perl-level signal handlers run\, this suggests signals are getting merged on a Perl level (they are far too low).

Impressive! now that's a proper test case\, showing the problem quickly! I confirm that "25​:10" and "20​:10" trigger the problem for me with our system perl.

I worked my way up to a $count of 500 with 100 threads on both Mac OS X and VMS and it never fails. Those are two threading implementations that are completely different from each other and (as far as I know) from Linux. This was with blead\, so I guess it's possible we fixed something else since 5.18.x\, but it seems more likely that Linux thread handling is what's different. I know very little except what I've read at \<http​://www.linuxprogrammingblog.com/all-about-linux-signals?page=show>. It does say that in addition to a process-wide signal mask there is a separate signal mask per thread and that you call pthread_sigmask to set it. We aren't doing that when we block SIGCHLD around a system() call\, so it might be worth a try.

p5pRT commented 9 years ago

From @Leont

On Fri\, Nov 21\, 2014 at 7​:42 PM\, Duncan White \d\.white@&#8203;imperial\.ac\.uk wrote​:

I wasn't criticising\, I just wanted to disspell the "here be dragons" feeling I got from your hint​:-)

This whole class of issues is fairly "here be dragons" if you ask me.

Umm\, as yet I don't know how (in mg.c) to access the Perl thread id\, if you could tell me that I'll give it a try. I had enough difficulty realising to use Perlio_printf etc rather than fprintf​:-)

Getting the perl-level tid is a bit tricky I'm afraid\, but printing aTHX with "%p" would already be quite useful. Actually\, PL_curinterp == aTHX (true in the main thread) would likely be all the information we need.

I now have some time series data\, the (utterly crap) diagnostic patch

that generates it\, two time series runs for about a minute\, and the gnuplot commands to draw a graph\, and a postscript form of the graph (output.ps)\, all in the attached tarball. I should say that the perl interpreters with the diag patch are really unreliable\, they often hang when running the "server" version in the earlier tarball; to get the two timeseries I've plotted\, I had to ctrl-c and rerun about 5 times​:-)

You're doing buffered IO in a signal handler\, not-crashing 1 out of 5 times is not a bad score.

Good luck tracking things down\, I'm off home now. I'll intermittently check email over the weekend\, and catch up properly on Monday.

Masking the signal in the main thread appears to make the problem go away\, which would confirm my theory about the signal getting overdelivered to the main thread. That's not a generic solution though.

Leon

p5pRT commented 9 years ago

From @Leont

On Sat\, Nov 22\, 2014 at 12​:03 AM\, Craig A. Berry \craig\.a\.berry@&#8203;gmail\.com wrote​:

I worked my way up to a $count of 500 with 100 threads on both Mac OS X and VMS and it never fails. Those are two threading implementations that are completely different from each other and (as far as I know) from Linux. This was with blead\, so I guess it's possible we fixed something else since 5.18.x\, but it seems more likely that Linux thread handling is what's different.

If they don't prefer delivery to the main thread\, then will not see this issue (assuming my hypothesis is correct). POSIX leaves that behavior unspecified.

Actually this might offer a clean way out. The original problem existed on *BSD and Darwin but not on Linux. If my script is also non-problematic on *BSD\, the solution may be to only mask on bsdish systems.

I know very little except what I've read at \< http​://www.linuxprogrammingblog.com/all-about-linux-signals?page=show>. It does say that in addition to a process-wide signal mask there is a separate signal mask per thread and that you call pthread_sigmask to set it. We aren't doing that when we block SIGCHLD around a system() call\, so it might be worth a try.

There are both per-process and per-thread signal queues but nothing else is both. The signal disposition (handler) is per-process\, the signal masks are per-thread. Did I mention here be dragons?

Leon

p5pRT commented 9 years ago

From d.white@imperial.ac.uk

Dear Leon\, all\,

On 21/11/14 23​:11\, Leon Timmermans wrote​:

On Fri\, Nov 21\, 2014 at 7​:42 PM\, Duncan White \<d.white@​imperial.ac.uk \mailto&#8203;:d\.white@&#8203;imperial\.ac\.uk> wrote​:

I wasn't criticising\, I just wanted to disspell the "here be dragons"
feeling I got from your hint&#8203;:\-\)

This whole class of issues is fairly "here be dragons" if you ask me.

Fair enough​:-) Guess I was pretty lucky (only 7 years or more) up to now then​:-)

Getting the perl-level tid is a bit tricky I'm afraid\, but printing aTHX with "%p" would already be quite useful. Actually\, PL_curinterp == aTHX (true in the main thread) would likely be all the information we need.

I'm afraid I didn't have time to try adding this into the diag patch.

I now have some time series data\, the \(utterly crap\) diagnostic patch
that generates it\, two time series runs for about a minute\, and the
gnuplot commands to draw a graph\, and a postscript form of the graph
\(output\.ps \<http&#8203;://output\.ps>\)\, all in the attached tarball\.  I
should say that the perl
interpreters with the diag patch are really unreliable\, they often hang
when running the "server" version in the earlier tarball; to get the
two timeseries I've plotted\, I had to ctrl\-c and rerun about 5 times&#8203;:\-\)

You're doing buffered IO in a signal handler\, not-crashing 1 out of 5 times is not a bad score.

Absolutely​:-)

Good luck tracking things down\, I'm off home now\.  I'll intermittently
check email over the weekend\, and catch up properly on Monday\.

Masking the signal in the main thread appears to make the problem go away\, which would confirm my theory about the signal getting overdelivered to the main thread. That's not a generic solution though.

Leon

I wanted to check - I wasn't really sure whether you guys were needing anything more from me\, I feel I've given you all the info (including the diag timeseries patch) to reproduce it - with your brilliant short example making that easy.

I have a couple of small new things to add​: I explained this perl bug to a friend of mine\, we spent some time simplifying your short example Perl program\, I can reproduce the "too many pending signals (120)" error with the following ridiculously short version\, 4 lines long\, only 1 thread created​:

#! /usr/bin/env perl use threads; $SIG{CHLD} = sub { }; threads->create( sub { system 'sleep 0' for 1..300; } )->join;

On my machine\, this reliably crashes almost instantly. I may play with my two timeseries producing Perl interpreters (one with the sigchld-blocking-in-system patch in\, one without it) running the above code\, because this is so simple we might learn something new.

Another version of this\, only a few lines longer\, made the signal handler increase a count and print it out in the child thread​:

#! /usr/bin/env perl use threads; my $count=0; $SIG{CHLD} = sub { $count++ }; threads->create(   sub {   foreach my $i (1..300)   {   system 'sleep 0';   print "count=$count\n";   }   } )->join; print "$count sigchilds handled\n";

At first\, I forgot to make $count "​: shared" (and use threads​::shared)\, and thus ended up with two $count's\, one in the main program\, the other in the explicitly created thread. Interestingly\, printing the count in the thread's sub shows it gradually increasing\, i.e. the signal handler was being invoked in the explicit thread. After the thread exits (with final $count between 10 and 30\, say)\, the main program's $count is still 0\, which then increments to 1 (somewhere around the thread death/join time).

Of course\, when I made it : shared\, the only change was that the final value was the final-inner value + 1\,i.e. if the last thing printed inside the explicit thread was 22\, then it printed "23 sigchilds handled" at the end.

I wondered whether this tells us anything new about signal handling in Perl.

The final thing to say is that I have worked round my production- system's problem be writing my own Perl my_system() function\, and calling that instead of system(). That's just a temporary fix to make stuff work for me\, of course. I still think you guys need to diagnose and then fix Perl so others don't experience this again. cheers Duncan