Closed nicowilliams closed 7 years ago
@nicowilliams, to be honest I think we shouldn't get into the dark zone with the undocumented glibc features, at least for now. I have no good use-case and test-case, nor demand for this optimization.
I've tried the idea with the separate helper thread which calls vfork(). And the results are pretty good. 26% slow-down compared to a pure vfork(); 13% comes just from the pthread_create() + pthread_join().
# direct call to "tiny2" by vfork()
famzah@vbox64:~/svn/github/popen-noshell/performance_tests/wrapper$ ./run-tests.sh
real 0m4.646s
real 0m4.619s
real 0m4.601s
famzah@vbox64:~/svn/github/popen-noshell/performance_tests/threads$ ./run-tests.sh
real 0m5.779s
real 0m5.835s
real 0m5.854s
# Only pthread() create + join without vfork():
real 0m0.764s
real 0m0.757s
real 0m0.763s
@nicowilliams, to be honest I think we shouldn't get into the dark zone with the undocumented glibc features, at least for now. I have no good use-case and test-case, nor demand for this optimization.
Sure.
I've tried the idea with the separate helper thread which calls vfork(). And the results are pretty good. 26% slow-down compared to a pure vfork(); 13% comes just from the pthread_create() + pthread_join().
Don't pthread_join() though! Detach that thread!
To my surprise, creating the threads "detached" slows things down! Both using the pthread_attr_t *attr
feature to create the threads directly "detached", and using pthread_detach()
give the same benchmark results, which are slower than the original code which uses pthread_join()
:
famzah@vbox64:~/svn/github/popen-noshell/performance_tests/threads$ ./run-tests.sh
real 0m7.131s
real 0m7.190s
real 0m7.140s
To my surprise, creating the threads "detached" slows things down! Both using the
pthread_attr_t *attr
feature to create the threads directly "detached", and usingpthread_detach()
give the same benchmark results, which are slower than the original code which usespthread_join()
:
I am as surprised as you! Perhaps cleaning up detached threads requires a lot of overhead? OK, I give up :( Thanks for trying though!
I intend to find the time to finish an implementation of avfork() that uses a task queue. Assuming that pthread_cond_wait() and pthread_cond_signal() do not slow things down ridiculously, I expect that to be reasonably fast. But who knows! With these numbers you're showing, there may be no way to make it not suck other than to implement avfork() directly in glibc (which I'm not willing to do).
The only "special" condition on my test machine is that it has one virtual CPU core, because this allows me to benchmark multi-threaded apps more easily. Maybe running those tests on a single CPU core could be the reason for this unexpected slow down.
I am definitly not familiar with clone
and vfork
as you guys do, but perhaps you can consider the use of CLONE_CLEAR_SIGHAND
to let the kernel reset signal handle table and disable any sort of signal handle by either user or glibc itself, and uses syscall
source code to manully invoke syscalls, so that you wouldn't have to CLONE_VFORK
to stop the parent at all and doesn't need to worry about cancellation and what-so-ever internal global state of glibc?
On reuse of stack, IMHO close-on-exec pipe
can be used to deal with it:
Create close-on-exec pipe
before clone()
, and close its read end in parent. Then uses select
to wait for POLLIN_SET
(actually waiting for EPOLLHUP
) or uses poll
or epoll
to wait for EPOLLHUP
.
This can be potentially done on another thread or by the main program itself if it also utilizes polling interface.
popen
can also writes error msg to the pipe on error.
Hi NobodyXu. I tried to grasp what you said but I'm too much out of context now. Furthermore, the standard posix_spawn() syscall should already be available on any up-to-date Linux distro, as it have been 4 years since it got merged into the glibc. My tests show that posix_spawn() is as fast as my library. Therefore, developers should switch to posix_spawn() which is maintained mainstream and should be more bug free.
Sorry, I didn't know that you have given up on this topic.:D
Hi famzah
Sorry to bother you again, but I have successfully implemented @nicowilliams 's idea on avfork
and archieved a more responsive aspawn
(source code, benchmarking is done via google/benchmark):
$ ll -h bench_aspawn_responsiveness.out
-rwxrwxr-x 1 nobodyxu nobodyxu 254K Oct 2 15:02 bench_aspawn_responsiveness.out*
$ uname -a
Linux pop-os 5.4.0-7642-generic #46~1598628707~20.04~040157c-Ubuntu SMP Fri Aug 28 18:02:16 UTC x86_64 x86_64 x86_64 GNU/Linux
$ ./a.out
2020-10-02T15:02:45+10:00
Running ./bench_aspawn_responsiveness.out
Run on (12 X 4100 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 9216 KiB (x1)
Load Average: 0.31, 0.36, 0.32
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_aspawn_no_reuse 18009 ns 17942 ns 38943
BM_aspawn/threads:1 14500 ns 14446 ns 48339
BM_vfork_with_shared_stack 46545 ns 16554 ns 44027
BM_fork 54583 ns 54527 ns 12810
BM_posix_spawn 125061 ns 29091 ns 24483
The column "Time" is measured in terms of system clock, while "CPU" is measured in terms of per-process CPU time.
aspawn
struct Stack_t {
void *addr;
size_t size;
};
typedef int (*aspawn_fn)(void *arg, int wirte_end_fd, void *old_sigset, void *user_data, size_t user_data_len);
/**
* @return fd of read end of CLOEXEC pipe if success, eitherwise (-errno).
*
* aspawn would disable thread cancellation, then it would revert it before return.
*
* aspawn would also mask all signals in parent and reset the signal handler in the child process.
* Before aspawn returns in parent, it would revert the signal mask.
*
* In the function fn, you can only use syscall declared in syscall/syscall.h
* Use of any glibc function or any function that modifies global/thread-local variable is undefined behavior.
*/
int aspawn(pid_t *pid, struct Stack_t *cached_stack, size_t reserved_stack_sz,
aspawn_fn fn, void *arg, void *user_data, size_t user_data_len);
By returning the write end of the CLOEXEC
pipefd, user of this library is able to receive error message/check whether
the child process has done using cached_stack
so that aspawn
can reuse cached_stack
.
It also allows user to pass arbitary data in the stack via user_data
and user_data_len
, which get copies onto top of
the stack, thus user does not have to allocate them separately on heap or mistakenly overwriten an object used in child process.
To use a syscall, you need to include syscall/syscall.h
, which defines the syscall routine used by the child process including
find_exe
, psys_execve
and psys_execveat
.
User will be able to reuse stack by poll
ing the fd returned by aspawn
and wait for it to hup.
Compare to posix_spawn
, aspawn
has 3 advantages:
aspawn
allows user to do anything in the child process before exec
.aspawn
can reuse stack, posix_spawn
can't;aspawn
doesn't block the parent thread;The only downside being that aspawn_fn
has to use syscall/syscall.h
.
Other than that, I don't see any downsides to my approach.
I added a reference to your project in the README of "popen-noshell". Keep up the good work!
@famzah Thank you and I will keep on improving it :D
Hi all.
I'm sorry if this is wrong thread for my question, but i'll give a try.
What about if i have popen on old libc, and after that i must to wait input to pipe, catch it, and check if the child done - if not - wait for input again.
Because of blocking of child when pipe fd buffer full.
So, the only thing i can found to check if pipes fd buf have data already - using of ioctl(). But, on quite old systems it returns -1.
How can i hack it?
Thanks.
If you mind blocking, you can simply read
on that fd.
If you do mind blocking, you can utilize select
or mark the pipe as O_NONBLOCK
.
A note for my future me, if I get back to research this again: