Hanging in ARCH_FORK with CPUPROFILE

GoogleCodeExporter commented 9 years ago

There are two ways I have been able to reproduce the problem.
The first method occurs at random, and in spans of time (running in release 
mode).
The second seems to occur every time I run internal tools linked against 
libprofiler with gdb/cgdb.
I have been unable to generate a simplified reproducer that can be shared.

What steps will reproduce the problem?
1. compile code in debug mode, linked against libprofiler.so
2. run executable in cgdb
3. wait
4. interrupt execution and observe that:
    a. all but one thread are waiting in poll, or epoll, or pthread_cond_wait, or etc.
    b. one thread is stuck in a fork system call, on the ARCH_FORK line
    c. CPU is at 100%

What is the expected output? What do you see instead?
The program is expected to finish normally.
The program hangs 'forever' in a call to fork(). On the ARCH_FORK() macro with 
$rax = -ERESTARTNOINTR

What version of the product are you using? On what operating system?
2.2.1 / 2.4
RHEL6

Please provide any additional information below.
I have a quick (non-complete) fix (attached) for this using pthread_atfork and 
pthread_sigmask to block SIGPROF before a fork and then re-enable it 
afterwards. From my testing, this always prevents the hanging issue.

I have communicated my fix with Developer Services at my job and they have 
indicated that it would be preferred if this solution could be patched into the 
gperftools source code.

While this is probably sufficient for the usecase at my job, it feels 
incomplete for the purposes of patching into the gperftools codebase.

Original issue reported on code.google.com by Sam.J.Ja...@gmail.com on 20 Jul 2015 at 5:18

Attachments:

[hang in ARCH_FORK.png](https://storage.googleapis.com/google-code-attachments/gperftools/issue-701/comment-0/hang in ARCH_FORK.png)
cpu_profiler_nohang.cpp

GoogleCodeExporter commented 9 years ago

Thanks for bug report.

I would like to understand it a bit more. I.e. it's great that blocking SIGPROF 
during fork helps your case, but I'm really curious why not having it causes 
fork to spin. Is that because signal always triggers during fork? But then how 
is that possible ?

Can you please submit some test program that causes this behavior ? Or maybe 
elaborate more on your finding?

Original comment by alkondratenko on 21 Jul 2015 at 2:44

GoogleCodeExporter commented 9 years ago

The signal does not always trigger during fork when run in release mode. 
However, as far as I can tell is does always trigger with GDB/CGDB.

From my understanding, this errno is handled by the kernel by re-attempting the 
interrupted syscall (reset $rax and move the instruction pointer back). Why 
this gets trapped in a spin is beyond me though.

As I mentioned, I have as-of-yet been unable to create a reproducer case, but I 
will keep looking into it.

Original comment by Sam.J.Ja...@gmail.com on 21 Jul 2015 at 3:10

GoogleCodeExporter commented 9 years ago

Hello, I am still unable to produce a reproducer that can be shared outside of 
my company. One thing about this mainprog is that it links against >400 .so 
shared libraries. I do not know if this has any relation to the hanging, but if 
it does, it may explain why I have not been able to create a reproducer that 
can be shared.

I understand if this is not enough information/not reasonably 
reproducible/testable for you. If this is not something that can be looked 
at/handled in the short term, please let me know so that I can communicate this 
with Developer Services and move forward with my hacky fix internally.

Please let me know if there's any other information that I can give you.

Thank you.

Original comment by Sam.J.Ja...@gmail.com on 23 Jul 2015 at 7:43

GoogleCodeExporter commented 9 years ago

We actually had a number of other reports of system getting weirdly stuck on 
RHEL6 boxes. I was also thinking more about your report (which is a lot more 
helpful than others btw).

Here's my theory. You have larger app that runs multiple threads while 
occasionally doing system() for something. So there's some signficant chance 
that thread that does fork may receive SIGPROF. And lets assume for now that 
some specific RHEL6 kernel or maybe all of them have that weird handling if 
ERESTRARTNOINT.

I would like you to confirm few things for me:

* have you tried different OS or kernel ? Have you seen this problem on 
non-RHEL6 ?

* what is your exact version of libc and kernel? I.e. in case I could try 
getting those exact versions to try to reproduce this.

* please confirm that you are not actively running cpuprofiler, just malloc 
with linked in profiler. We have known issue 406 where apparently we set up 
timer (but I thought not signal) even if profiling is not enabled. Perhaps 
fixing that would be better workaround for your case.

Thanks.

Original comment by alkondratenko on 24 Jul 2015 at 3:52

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Thanks for your quick reply!

* I have not tried this on other OS/kernels.
* Kernel Version: 2.6.32-504.23.4.el6.x86_64, LibC Version: 2.12
* 
    - In debug mode, this problem occurs both with and without defining CPUPROFILE. From my understanding, setitimer, which is called in StartTimer from RegisterThread, will start the clock for a sigprof until it is set to zero. So even if cpuprofiler doesn't handle the signals, they still get sent.
    - In release mode, this problem only occurs when cpuprofiler is turned on, but is not reliably reproducible.

Let me know if you have any other questions.

Original comment by Sam.J.Ja...@gmail.com on 24 Jul 2015 at 6:58

GoogleCodeExporter commented 9 years ago

Can you do few runs with other OS. Like rhel 7 ?

Also can you confirm if your app is actively utilizing multiple threads while 
calling fork?

Original comment by alkondratenko on 24 Jul 2015 at 8:14

GoogleCodeExporter commented 9 years ago

I talked with the dev services guys, we do not have any non-RHEL6 machines that 
have the main codebase on them for use. Since I have been unable to make a 
shareable reproducer, I will not be able to test this on other OSes.

Yes, there are multiple running threads when the hanging is triggered.

Original comment by Sam.J.Ja...@gmail.com on 24 Jul 2015 at 9:15

GoogleCodeExporter commented 9 years ago

Thanks for update. I plan to take a closer look at your case in 3 weeks.

BTW can you please report exact package version of glibc on your box? rpm -qi 
glibc will report it for you.

Original comment by alkondratenko on 25 Jul 2015 at 5:11

GoogleCodeExporter commented 9 years ago

Here's the info that looked relevant from running it:

Name        : glibc
Version     : 2.12
Vendor      : Red Hat, Inc.
Release     : 1.149.el6_6.9
Source RPM  : glibc-2.12-1.149.el6_6.9.src.rpm

Original comment by Sam.J.Ja...@gmail.com on 4 Aug 2015 at 1:55

edisonwsk / gperftools

Hanging in ARCH_FORK with CPUPROFILE #701