famzah / popen-noshell

A much faster popen() and system() implementation for Linux
68 stars 13 forks source link

Portability option to use vfork()? (or posix_spawn()) #11

Closed nicowilliams closed 7 years ago

famzah commented 7 years ago

A note for my future me, if I get back to research this again:

famzah commented 7 years ago

@nicowilliams, to be honest I think we shouldn't get into the dark zone with the undocumented glibc features, at least for now. I have no good use-case and test-case, nor demand for this optimization.

I've tried the idea with the separate helper thread which calls vfork(). And the results are pretty good. 26% slow-down compared to a pure vfork(); 13% comes just from the pthread_create() + pthread_join().

# direct call to "tiny2" by vfork()
famzah@vbox64:~/svn/github/popen-noshell/performance_tests/wrapper$ ./run-tests.sh 
real    0m4.646s
real    0m4.619s
real    0m4.601s

famzah@vbox64:~/svn/github/popen-noshell/performance_tests/threads$ ./run-tests.sh 
real    0m5.779s
real    0m5.835s
real    0m5.854s

# Only pthread() create + join without vfork():
real    0m0.764s
real    0m0.757s
real    0m0.763s
nicowilliams commented 7 years ago

@nicowilliams, to be honest I think we shouldn't get into the dark zone with the undocumented glibc features, at least for now. I have no good use-case and test-case, nor demand for this optimization.

Sure.

I've tried the idea with the separate helper thread which calls vfork(). And the results are pretty good. 26% slow-down compared to a pure vfork(); 13% comes just from the pthread_create() + pthread_join().

Don't pthread_join() though! Detach that thread!

famzah commented 7 years ago

To my surprise, creating the threads "detached" slows things down! Both using the pthread_attr_t *attr feature to create the threads directly "detached", and using pthread_detach() give the same benchmark results, which are slower than the original code which uses pthread_join():

famzah@vbox64:~/svn/github/popen-noshell/performance_tests/threads$ ./run-tests.sh 
real    0m7.131s
real    0m7.190s
real    0m7.140s
nicowilliams commented 7 years ago

To my surprise, creating the threads "detached" slows things down! Both using the pthread_attr_t *attr feature to create the threads directly "detached", and using pthread_detach() give the same benchmark results, which are slower than the original code which uses pthread_join():

I am as surprised as you! Perhaps cleaning up detached threads requires a lot of overhead? OK, I give up :( Thanks for trying though!

I intend to find the time to finish an implementation of avfork() that uses a task queue. Assuming that pthread_cond_wait() and pthread_cond_signal() do not slow things down ridiculously, I expect that to be reasonably fast. But who knows! With these numbers you're showing, there may be no way to make it not suck other than to implement avfork() directly in glibc (which I'm not willing to do).

famzah commented 7 years ago

The only "special" condition on my test machine is that it has one virtual CPU core, because this allows me to benchmark multi-threaded apps more easily. Maybe running those tests on a single CPU core could be the reason for this unexpected slow down.

NobodyXu commented 4 years ago

I am definitly not familiar with clone and vfork as you guys do, but perhaps you can consider the use of CLONE_CLEAR_SIGHAND to let the kernel reset signal handle table and disable any sort of signal handle by either user or glibc itself, and uses syscall source code to manully invoke syscalls, so that you wouldn't have to CLONE_VFORK to stop the parent at all and doesn't need to worry about cancellation and what-so-ever internal global state of glibc?

NobodyXu commented 4 years ago

On reuse of stack, IMHO close-on-exec pipe can be used to deal with it:

Create close-on-exec pipe before clone(), and close its read end in parent. Then uses select to wait for POLLIN_SET(actually waiting for EPOLLHUP) or uses poll or epoll to wait for EPOLLHUP.

This can be potentially done on another thread or by the main program itself if it also utilizes polling interface.

popen can also writes error msg to the pipe on error.

famzah commented 4 years ago

Hi NobodyXu. I tried to grasp what you said but I'm too much out of context now. Furthermore, the standard posix_spawn() syscall should already be available on any up-to-date Linux distro, as it have been 4 years since it got merged into the glibc. My tests show that posix_spawn() is as fast as my library. Therefore, developers should switch to posix_spawn() which is maintained mainstream and should be more bug free.

NobodyXu commented 4 years ago

Sorry, I didn't know that you have given up on this topic.:D

NobodyXu commented 4 years ago

Hi famzah

Sorry to bother you again, but I have successfully implemented @nicowilliams 's idea on avfork and archieved a more responsive aspawn (source code, benchmarking is done via google/benchmark):

$ ll -h bench_aspawn_responsiveness.out
-rwxrwxr-x 1 nobodyxu nobodyxu 254K Oct  2 15:02 bench_aspawn_responsiveness.out*

$ uname -a
Linux pop-os 5.4.0-7642-generic #46~1598628707~20.04~040157c-Ubuntu SMP Fri Aug 28 18:02:16 UTC  x86_64 x86_64 x86_64 GNU/Linux

$ ./a.out
2020-10-02T15:02:45+10:00
Running ./bench_aspawn_responsiveness.out
Run on (12 X 4100 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 9216 KiB (x1)
Load Average: 0.31, 0.36, 0.32
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
BM_aspawn_no_reuse              18009 ns        17942 ns        38943
BM_aspawn/threads:1             14500 ns        14446 ns        48339
BM_vfork_with_shared_stack      46545 ns        16554 ns        44027
BM_fork                         54583 ns        54527 ns        12810
BM_posix_spawn                 125061 ns        29091 ns        24483

The column "Time" is measured in terms of system clock, while "CPU" is measured in terms of per-process CPU time.

NobodyXu commented 4 years ago

Intro on aspawn

struct Stack_t {
    void *addr;
    size_t size;
};

typedef int (*aspawn_fn)(void *arg, int wirte_end_fd, void *old_sigset, void *user_data, size_t user_data_len);

/**
 * @return fd of read end of CLOEXEC pipe if success, eitherwise (-errno).
 *
 * aspawn would disable thread cancellation, then it would revert it before return.
 *
 * aspawn would also mask all signals in parent and reset the signal handler in the child process.
 * Before aspawn returns in parent, it would revert the signal mask.
 *
 * In the function fn, you can only use syscall declared in syscall/syscall.h
 * Use of any glibc function or any function that modifies global/thread-local variable is undefined behavior.
 */
int aspawn(pid_t *pid, struct Stack_t *cached_stack, size_t reserved_stack_sz, 
           aspawn_fn fn, void *arg, void *user_data, size_t user_data_len);

By returning the write end of the CLOEXEC pipefd, user of this library is able to receive error message/check whether the child process has done using cached_stack so that aspawn can reuse cached_stack.

It also allows user to pass arbitary data in the stack via user_data and user_data_len, which get copies onto top of the stack, thus user does not have to allocate them separately on heap or mistakenly overwriten an object used in child process.

To use a syscall, you need to include syscall/syscall.h, which defines the syscall routine used by the child process including find_exe, psys_execve and psys_execveat.

User will be able to reuse stack by polling the fd returned by aspawn and wait for it to hup.

Advantages and disadvantages

Compare to posix_spawn, aspawn has 3 advantages:

The only downside being that aspawn_fn has to use syscall/syscall.h. Other than that, I don't see any downsides to my approach.

NobodyXu commented 4 years ago

Example code can be seen here

famzah commented 4 years ago

I added a reference to your project in the README of "popen-noshell". Keep up the good work!

NobodyXu commented 4 years ago

@famzah Thank you and I will keep on improving it :D

kotee4ko commented 3 years ago

Hi all.

I'm sorry if this is wrong thread for my question, but i'll give a try.

What about if i have popen on old libc, and after that i must to wait input to pipe, catch it, and check if the child done - if not - wait for input again.

Because of blocking of child when pipe fd buffer full.

So, the only thing i can found to check if pipes fd buf have data already - using of ioctl(). But, on quite old systems it returns -1.

How can i hack it?

Thanks.

NobodyXu commented 3 years ago

If you mind blocking, you can simply read on that fd.

If you do mind blocking, you can utilize select or mark the pipe as O_NONBLOCK.