Open GoogleCodeExporter opened 9 years ago
Thanks for bug report.
I would like to understand it a bit more. I.e. it's great that blocking SIGPROF
during fork helps your case, but I'm really curious why not having it causes
fork to spin. Is that because signal always triggers during fork? But then how
is that possible ?
Can you please submit some test program that causes this behavior ? Or maybe
elaborate more on your finding?
Original comment by alkondratenko
on 21 Jul 2015 at 2:44
The signal does not always trigger during fork when run in release mode.
However, as far as I can tell is does always trigger with GDB/CGDB.
From my understanding, this errno is handled by the kernel by re-attempting the
interrupted syscall (reset $rax and move the instruction pointer back). Why
this gets trapped in a spin is beyond me though.
As I mentioned, I have as-of-yet been unable to create a reproducer case, but I
will keep looking into it.
Original comment by Sam.J.Ja...@gmail.com
on 21 Jul 2015 at 3:10
Hello, I am still unable to produce a reproducer that can be shared outside of
my company. One thing about this mainprog is that it links against >400 .so
shared libraries. I do not know if this has any relation to the hanging, but if
it does, it may explain why I have not been able to create a reproducer that
can be shared.
I understand if this is not enough information/not reasonably
reproducible/testable for you. If this is not something that can be looked
at/handled in the short term, please let me know so that I can communicate this
with Developer Services and move forward with my hacky fix internally.
Please let me know if there's any other information that I can give you.
Thank you.
Original comment by Sam.J.Ja...@gmail.com
on 23 Jul 2015 at 7:43
We actually had a number of other reports of system getting weirdly stuck on
RHEL6 boxes. I was also thinking more about your report (which is a lot more
helpful than others btw).
Here's my theory. You have larger app that runs multiple threads while
occasionally doing system() for something. So there's some signficant chance
that thread that does fork may receive SIGPROF. And lets assume for now that
some specific RHEL6 kernel or maybe all of them have that weird handling if
ERESTRARTNOINT.
I would like you to confirm few things for me:
* have you tried different OS or kernel ? Have you seen this problem on
non-RHEL6 ?
* what is your exact version of libc and kernel? I.e. in case I could try
getting those exact versions to try to reproduce this.
* please confirm that you are not actively running cpuprofiler, just malloc
with linked in profiler. We have known issue 406 where apparently we set up
timer (but I thought not signal) even if profiling is not enabled. Perhaps
fixing that would be better workaround for your case.
Thanks.
Original comment by alkondratenko
on 24 Jul 2015 at 3:52
Thanks for your quick reply!
* I have not tried this on other OS/kernels.
* Kernel Version: 2.6.32-504.23.4.el6.x86_64, LibC Version: 2.12
*
- In debug mode, this problem occurs both with and without defining CPUPROFILE. From my understanding, setitimer, which is called in StartTimer from RegisterThread, will start the clock for a sigprof until it is set to zero. So even if cpuprofiler doesn't handle the signals, they still get sent.
- In release mode, this problem only occurs when cpuprofiler is turned on, but is not reliably reproducible.
Let me know if you have any other questions.
Original comment by Sam.J.Ja...@gmail.com
on 24 Jul 2015 at 6:58
Can you do few runs with other OS. Like rhel 7 ?
Also can you confirm if your app is actively utilizing multiple threads while
calling fork?
Original comment by alkondratenko
on 24 Jul 2015 at 8:14
I talked with the dev services guys, we do not have any non-RHEL6 machines that
have the main codebase on them for use. Since I have been unable to make a
shareable reproducer, I will not be able to test this on other OSes.
Yes, there are multiple running threads when the hanging is triggered.
Original comment by Sam.J.Ja...@gmail.com
on 24 Jul 2015 at 9:15
Thanks for update. I plan to take a closer look at your case in 3 weeks.
BTW can you please report exact package version of glibc on your box? rpm -qi
glibc will report it for you.
Original comment by alkondratenko
on 25 Jul 2015 at 5:11
Here's the info that looked relevant from running it:
Name : glibc
Version : 2.12
Vendor : Red Hat, Inc.
Release : 1.149.el6_6.9
Source RPM : glibc-2.12-1.149.el6_6.9.src.rpm
Original comment by Sam.J.Ja...@gmail.com
on 4 Aug 2015 at 1:55
Original issue reported on code.google.com by
Sam.J.Ja...@gmail.com
on 20 Jul 2015 at 5:18Attachments: