easybuilders / easybuild-framework

EasyBuild is a software installation framework in Python that allows you to install software in a structured and robust way.
https://easybuild.io
GNU General Public License v2.0
152 stars 202 forks source link

Poor startup time of 'eb' #3180

Open casparvl opened 4 years ago

casparvl commented 4 years ago

I've talked to @boegel at the 5th EasyBuild user meeting about poor startup times of eb. He mentioned it was something he was looking to improve as well and ask me to share what we encountered - hence this issue.

What we see on our systems:

On Cartesius:

[casparl@int1 ~]$ time eb --version
This is EasyBuild 4.1.1 (framework: 4.1.1, easyblocks: 4.1.1) on host int1.bullx.

real    0m4.421s
user    0m1.205s
sys 0m1.666s

On Lisa:

casparl@login3:~$ time eb --version
This is EasyBuild 4.1.1 (framework: 4.1.1, easyblocks: 4.1.1) on host login3.lisa.surfsara.nl.

real    0m1.539s
user    0m1.032s
sys 0m0.184s

It seems to vary a lot, but overall performance isn't great.

My colleague @paulmelis has investigated it and made some traces. I don't have the time to organize it right now, but will simply dump those in this ticket so that you have a starting point.

Short summary of what follows below: calls to subprocess.Popen with close_fds seem to be slow on particular systems. Somehow, the range of files this calls tries to close is much larger on Cartesius than on Lisa. On Cartesius, it is definitely a substantial part of the runtime, on Lisa, I'm not sure.

Below, Paul has identified a few cases where such calls are made, which may help you track down how this can be optimized.

casparvl commented 4 years ago

Let's do some system call tracing to see where the time it spent:

paulm@int3 13:59 /dev/shm$ time strace -c python -m easybuild.main
== temporary log file in case of crash /tmp/paulm/eb-V4WOTx/easybuild-FhwyW3.log
...
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 45.35    0.183438           8     22392     20846 open
 26.79    0.108382           5     20452      8605 stat
 14.53    0.058762           4     15535         6 lstat
  2.99    0.012085           4      2930           read
  2.40    0.009702           3      3359         1 fstat
  1.39    0.005628         375        15           clone
  1.34    0.005408           3      1663           close
  1.18    0.004769           5       963           munmap
  1.06    0.004291           4      1182           mmap
  0.87    0.003525           3      1068         1 readlink
  0.83    0.003359           2      1569       144 lseek
  0.20    0.000820           4       189       151 openat
  0.20    0.000789          10        76           getdents
  0.14    0.000569           6        89           mprotect
  0.13    0.000512           3       167           fcntl
  0.13    0.000510           4       119           poll
  0.08    0.000320           2       156           brk
  0.07    0.000284          15        19           wait4
  0.06    0.000244           4        69           rt_sigaction
  0.05    0.000208           1       163       101 access
  0.05    0.000199           3        68           write
  0.04    0.000174           5        33           pipe
  0.04    0.000166           4        41           sendto
  0.02    0.000083           4        22           getcwd
  0.01    0.000051          10         5           mkdir
  0.01    0.000050           3        18        14 ioctl
  0.01    0.000025           5         5         4 connect
  0.01    0.000022           4         6           socket
  0.01    0.000022           6         4           chdir
  0.00    0.000019           6         3         1 unlink
  0.00    0.000016           4         4           pipe2
  0.00    0.000012           4         3         3 statfs
  0.00    0.000011           1        11           uname
  0.00    0.000011           2         5           sched_getaffinity
  0.00    0.000008           4         2           getuid
  0.00    0.000007           4         2           geteuid
  0.00    0.000005           5         1           bind
  0.00    0.000005           5         1           chmod
  0.00    0.000004           4         1           umask
  0.00    0.000004           4         1           getgid
  0.00    0.000004           4         1           getegid
  0.00    0.000004           4         1           arch_prctl
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           symlink
  0.00    0.000000           0         2           getrlimit
  0.00    0.000000           0         1           futex
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.404507                 72421     29877 total

real    0m6.476s
user    0m2.579s
sys 0m4.077s

Hmmm, 22k calls to open(), although the time spent there isn't that high. Looking a bit closer into what is being opened:

paulm@int3 14:00 /dev/shm$ time strace -o /dev/shm/out.nofork -tt python -m easybuild.main
paulm@int3 14:01 /dev/shm$ wc -l /dev/shm/out.nofork 
72427 /dev/shm/out.nofork

# Are these files from EB itself?
# Btw, curious to see the different paths
paulm@int3 14:02 /dev/shm$ grep 'open("/hpc/eb/RedHatEnterpriseServer7/EasyBuild/3.9.0' /dev/shm/out.nofork | wc -l
4365
paulm@int3 14:03 /dev/shm$ grep 'open("/nfs/admin/hpc/sw/RedHatEnterpriseServer7/EasyBuild/3.9.0' /dev/shm/out.nofork |wc -l
1663

# Hmm, was expecting more open call to EB files, apparently its stuff from python itself
paulm@int3 14:04 /dev/shm$ grep 'open("/nfs/admin/hpc/sw/python-2.7.9' /dev/shm/out.nofork |wc -l
2743
paulm@int3 14:04 /dev/shm$ grep 'open("/hpc/sw/python-2.7.9' /dev/shm/out.nofork |wc -l
11131

And does eb call any subprocesses when it executes?

paulm@int3 14:05 /dev/shm$ time strace -o /dev/shm/out.fork -f -c python -m easybuild.main
== temporary log file in case of crash /tmp/paulm/eb-yXQoJy/easybuild-jlqwSX.log
...
real    2m35.075s
user    0m35.437s
sys 2m30.973s

paulm@int3 14:09 /dev/shm$ wc -l out
8472305 out

# So LOTS of system calls when also counting the subprocesses, what are those calls?
paulm@int3 14:10 /dev/shm$ time strace -f -c python -m easybuild.main
strace: Process 17333 attached
strace: Process 17334 attached
strace: Process 17335 attached
strace: Process 17355 attached
strace: Process 17373 attached
strace: Process 17392 attached
strace: Process 17473 attached
strace: Process 17474 attached
strace: Process 17476 attached
strace: Process 17477 attached
strace: Process 17478 attached
strace: Process 17497 attached
strace: Process 17498 attached
strace: Process 17499 attached
strace: Process 17500 attached
strace: Process 17501 attached
strace: Process 17502 attached
strace: Process 17503 attached
strace: Process 17504 attached
strace: Process 17505 attached
strace: Process 17506 attached
strace: Process 17524 attached
strace: Process 17525 attached
strace: Process 17526 attached
strace: Process 17527 attached
strace: Process 17528 attached
strace: Process 17529 attached
strace: Process 17531 attached
strace: Process 17551 attached
strace: Process 17552 attached
== temporary log file in case of crash /tmp/paulm/eb-ElNj0p/easybuild-Bek2G3.log
strace: Process 17555 attached
strace: Process 17556 attached
strace: Process 17557 attached
strace: Process 17558 attached
strace: Process 17559 attached
strace: Process 17580 attached
strace: Process 17581 attached
strace: Process 17582 attached
strace: Process 17583 attached
strace: Process 17584 attached
strace: Process 17585 attached
strace: Process 17586 attached
strace: Process 17587 attached
strace: Process 17588 attached
strace: Process 17589 attached
Usage: eb [options] easyconfig [...]
...
ERROR: Please provide one or multiple easyconfig files, or use software build options to make EasyBuild search for easyconfigs
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 97.59   25.463586           3   8390674   8388550 close
  0.90    0.233666          10     23973     22135 open
  0.64    0.166430           7     22636     10264 stat
  0.22    0.056606           4     15808         6 lstat
  0.21    0.053764         853        63        20 wait4
  0.13    0.034727       11576         3           select
  0.07    0.018179           5      3646           read
  0.05    0.012877           4      3659         4 fstat
  0.03    0.008240           5      1736           mmap
  0.03    0.006570           6      1066           munmap
  0.02    0.004235          18       234       201 execve
  0.01    0.003631           2      1611       163 lseek
  0.01    0.003153           3      1125         1 readlink
  0.01    0.003124           8       379           mprotect
  0.01    0.003093          16       189       151 openat
  0.01    0.002798           4       636           rt_sigprocmask
  0.01    0.002658           7       391       243 access
  0.01    0.002582           3       738           poll
  0.01    0.001498           3       464           rt_sigaction
  0.01    0.001327           3       408           brk
  0.00    0.001195          31        39         2 futex
  0.00    0.001152          15        76           getdents
  0.00    0.001098           4       269           sendto
  0.00    0.000920           3       284         2 fcntl
  0.00    0.000881           3       306           write
  0.00    0.000342          11        30           chdir
  0.00    0.000301           6        54           pipe
  0.00    0.000295           7        45           clone
  0.00    0.000293           5        60           dup2
  0.00    0.000237           3        78           getuid
  0.00    0.000208           3        71           getgid
  0.00    0.000205           3        66        45 ioctl
  0.00    0.000205           3        71           getegid
  0.00    0.000203           3        72           geteuid
  0.00    0.000192          10        20        16 connect
  0.00    0.000180          30         6           faccessat
  0.00    0.000126           4        33           arch_prctl
  0.00    0.000122           5        27           getcwd
  0.00    0.000120           4        30           uname
  0.00    0.000110           5        21           socket
  0.00    0.000108           4        26           getrlimit
  0.00    0.000094          19         5           mkdir
  0.00    0.000089           4        20           rt_sigreturn
  0.00    0.000084          14         6         1 unlink
  0.00    0.000061          61         1           symlink
  0.00    0.000061           2        33           set_robust_list
  0.00    0.000049           5        10           getppid
  0.00    0.000042           4        10           getpid
  0.00    0.000039           4         9           set_tid_address
  0.00    0.000037           4        10           getpgrp
  0.00    0.000019           4         5         5 statfs
  0.00    0.000018           4         5           sched_getaffinity
  0.00    0.000011           3         4           pipe2
  0.00    0.000010           3         3           madvise
  0.00    0.000008           1         9         9 getsockname
  0.00    0.000006           6         1           chmod
  0.00    0.000005           5         1           bind
  0.00    0.000004           4         1           umask
------ ----------- ----------- --------- --------- ----------------
100.00   26.091874               8471256   8421818 total

real    2m58.595s
user    0m43.497s
sys 2m54.598s

Holy crap, almost 98% of the time spent on close() calls of which 8388550/8390674 = 99.97% return an error??

paulm@int3 14:13 /dev/shm$ time strace -o /dev/shm/out.fork -f -tt python -m easybuild.main
== temporary log file in case of crash /tmp/paulm/eb-GIde92/easybuild-ti2Xi6.log
...
real    3m51.837s
user    1m15.295s
sys 3m13.255s

# There's masses of close() calls to an incrementing file descriptor, 
# all the way up to 2^20!
17778 14:15:56.310182 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b0e8cd35190) = 17783
17783 14:15:56.310727 set_robust_list(0x2b0e8cd351a0, 24) = 0
17778 14:15:56.310788 close(11)         = 0
17778 14:15:56.310854 brk(NULL)         = 0xeb7000
17778 14:15:56.310881 brk(0xfac000)     = 0xfac000
17778 14:15:56.310915 read(10,  <unfinished ...>
17783 14:15:56.311415 close(6)          = 0
17783 14:15:56.311515 close(8)          = 0
17783 14:15:56.311620 close(10)         = 0
17783 14:15:56.311718 dup2(7, 1)        = 1
17783 14:15:56.311804 dup2(9, 2)        = 2
17783 14:15:56.311953 close(7)          = 0
17783 14:15:56.312050 close(9)          = 0
17783 14:15:56.312177 chdir("/dev/shm") = 0
17783 14:15:56.312334 close(3)          = 0
17783 14:15:56.312417 close(4)          = 0
17783 14:15:56.312453 close(5)          = 0
17783 14:15:56.312488 close(6)          = -1 EBADF (Bad file descriptor)
17783 14:15:56.312525 close(7)          = -1 EBADF (Bad file descriptor)
17783 14:15:56.312566 close(8)          = -1 EBADF (Bad file descriptor)
17783 14:15:56.312658 close(9)          = -1 EBADF (Bad file descriptor)
17783 14:15:56.312730 close(10)         = -1 EBADF (Bad file descriptor)
17783 14:15:56.312770 close(12)         = 0
17783 14:15:56.312806 close(13)         = -1 EBADF (Bad file descriptor)
17783 14:15:56.312842 close(14)         = -1 EBADF (Bad file descriptor)
17783 14:15:56.312878 close(15)         = -1 EBADF (Bad file descriptor)
17783 14:15:56.312914 close(16)         = -1 EBADF (Bad file descriptor)
...
17783 14:16:24.665785 close(1048571)    = -1 EBADF (Bad file descriptor)
17783 14:16:24.665811 close(1048572)    = -1 EBADF (Bad file descriptor)
17783 14:16:24.665837 close(1048573)    = -1 EBADF (Bad file descriptor)
17783 14:16:24.665863 close(1048574)    = -1 EBADF (Bad file descriptor)
17783 14:16:24.665890 close(1048575)    = -1 EBADF (Bad file descriptor)

# Yikes, why would these calls ever be made? 
paulm@int3 14:24 /dev/shm$ grep EBADF out.fork |wc -l
8388556

# Seems this whole circus is happening multiple times during an eb run
paulm@int3 14:25 /dev/shm$ grep "close(1048575)" out.fork 
17783 14:16:24.665890 close(1048575)    = -1 EBADF (Bad file descriptor)
17957 14:16:52.828486 close(1048575)    = -1 EBADF (Bad file descriptor)
17983 14:17:21.205566 close(1048575)    = -1 EBADF (Bad file descriptor)
18007 14:17:49.652933 close(1048575)    = -1 EBADF (Bad file descriptor)
18037 14:18:19.853487 close(1048575)    = -1 EBADF (Bad file descriptor)
18272 14:18:48.468929 close(1048575)    = -1 EBADF (Bad file descriptor)
18310 14:19:17.670173 close(1048575)    = -1 EBADF (Bad file descriptor)
18345 14:19:46.179046 close(1048575)    = -1 EBADF (Bad file descriptor)

Apparently, this isn't completely weird and the appears to be a common way to close all inherited file descriptors in a child process. See e.g. here and here. The latter issue has an interesting comment about a piece of Python code using os.closerange(). Does EB use that call (or reference RLIMIT_NOFILE)?

paulm@int3 14:30 /dev/shm$ find /hpc/eb/RedHatEnterpriseServer7/EasyBuild/3.9.0/ -name \*.py | xargs -iX grep -sH closerange X
paulm@int3 14:30 /dev/shm$ 
paulm@int3 14:31 /dev/shm$ find /hpc/eb/RedHatEnterpriseServer7/EasyBuild/3.9.0/ -name \*.py | xargs -iX grep -sH RLIMIT_NOFILE X
paulm@int3 14:31 /dev/shm$ 

Apparently not. So where's the close() calls coming from?

Took some trial and error to get both strace output and Python trace output to the same stream, as using -m trace -m easybuild.main doesn't work.

paulm@int3 14:54 /dev/shm$ cat traceeb.py 
#!/usr/bin/env python
import sys, runpy

def trace_calls(frame, event, arg):
    if event != 'call':
        return
    co = frame.f_code
    func_name = co.co_name
    line_no = frame.f_lineno
    filename = co.co_filename
    print 'Call to %s on line %s of %s' % (func_name, line_no, filename)

sys.settrace(trace_calls)

import easybuild.main

easybuild.main.main()

paulm@int3 14:55 /dev/shm$ strace -f -tt ./traceeb.py  > out.trace 2>&1 

Did some inspection on out.trace and the close() calls come from subprocess.Popen._close_fds(), which loops from 3 to MAXFD. So looks like it is standard practice when a Popen object is done which has been created with the close_fds=True option:

paulm@int3 15:06 /dev/shm$ cat t_open_closefds.py
#!/usr/bin/env python
import subprocess

p = subprocess.Popen(['/bin/ls'], close_fds=True)
p.communicate()

paulm@int3 15:07 /dev/shm$ strace -f ./t_open_closefds.py 2>&1 | grep EBAD | head -n 10
[pid 24923] close(3)                    = -1 EBADF (Bad file descriptor)
[pid 24923] close(5)                    = -1 EBADF (Bad file descriptor)
[pid 24923] close(6)                    = -1 EBADF (Bad file descriptor)
[pid 24923] close(7)                    = -1 EBADF (Bad file descriptor)
[pid 24923] close(8)                    = -1 EBADF (Bad file descriptor)
[pid 24923] close(9)                    = -1 EBADF (Bad file descriptor)
[pid 24923] close(10)                   = -1 EBADF (Bad file descriptor)
[pid 24923] close(11)                   = -1 EBADF (Bad file descriptor)
[pid 24923] close(12)                   = -1 EBADF (Bad file descriptor)
[pid 24923] close(13)                   = -1 EBADF (Bad file descriptor)

So what's the time taken to close all those file descriptors?

paulm@int3 15:07 /dev/shm$ cat ./t_open.py 
#!/usr/bin/env python
import subprocess

p = subprocess.Popen(['/bin/ls'])
p.communicate()
paulm@int3 15:08 /dev/shm$ cat ./t_open_closefds.py 
#!/usr/bin/env python
import subprocess

p = subprocess.Popen(['/bin/ls'], close_fds=True)
p.communicate()

paulm@int3 15:08 /dev/shm$ time ./t_open.py 
out  out.fork  out.nofork  out.trace  t_open_closefds.py  t_open.py  trace  traceeb.py

real    0m0.062s
user    0m0.015s
sys 0m0.030s
paulm@int3 15:08 /dev/shm$ time ./t_open_closefds.py 
out  out.fork  out.nofork  out.trace  t_open_closefds.py  t_open.py  trace  traceeb.py

real    0m0.427s
user    0m0.145s
sys 0m0.274s

Amazing, that's 6.9x slower. And how many calls when running eb?

paulm@int3 15:09 /dev/shm$ grep _close_fds out.trace 
[pid 23534] 14:50:33.407434 write(1, "Call to _close_fds on line 1181 "..., 83) = 83
[pid 23560] 14:51:04.989916 write(1, "Call to _close_fds on line 1181 "..., 83) = 83
[pid 23587] 14:51:36.370561 write(1, "Call to _close_fds on line 1181 "..., 83) = 83
[pid 23628] 14:52:07.992522 write(1, "Call to _close_fds on line 1181 "..., 83) = 83
[pid 23676] 14:52:58.400439 write(1, "Call to _close_fds on line 1181 "..., 83) = 83
[pid 23713] 14:53:29.894816 write(1, "Call to _close_fds on line 1181 "..., 83) = 83
[pid 23957] 14:54:16.855624 write(1, "Call to _close_fds on line 1181 "..., 83) = 83

That amounts to almost 3 seconds of run time out of almost 5 seconds, spent on closing file descriptors (that are not open).

paulm@int3 15:12 /dev/shm$ find /hpc/eb/RedHatEnterpriseServer7/EasyBuild/3.9.0/ -name \*.py | xargs -iX grep -sH close_fds X
/hpc/eb/RedHatEnterpriseServer7/EasyBuild/3.9.0/lib/python2.7/site-packages/vsc_base-2.8.4-py2.7.egg/vsc/utils/run.py:            'close_fds': True,
/hpc/eb/RedHatEnterpriseServer7/EasyBuild/3.9.0/lib/python2.7/site-packages/easybuild_framework-3.9.0-py2.7.egg/easybuild/tools/run.py:                                stdin=subprocess.PIPE, close_fds=True, executable=exec_cmd)
/hpc/eb/RedHatEnterpriseServer7/EasyBuild/3.9.0/lib/python2.7/site-packages/easybuild_framework-3.9.0-py2.7.egg/easybuild/tools/run.py:        p = Popen(cmd, shell=True, stdout=PIPE, stderr=STDOUT, stdin=PIPE, close_fds=True, executable="/bin/bash")

vsc_base is apparently some base package by UGent (https://github.com/hpcugent/vsc-base).

Let's see if we can get a callstack at the point the _close_fds() call is made:

paulm@int3 15:24 /dev/shm$ cat ./signalclosefds.py 
#!/usr/bin/env python
import sys, runpy

def trace_calls(frame, event, arg):
    if event != 'call':
        return
    co = frame.f_code
    func_name = co.co_name
    line_no = frame.f_lineno
    filename = co.co_filename
    if func_name == '_close_fds':
        raise ValueError('Doh!')
    print 'Call to %s on line %s of %s' % (func_name, line_no, filename)

sys.settrace(trace_calls)

import easybuild.main

easybuild.main.main()

paulm@int3 15:25 /dev/shm$ ./signalclosefds.py
Call to load_setitem on line 1193 of /hpc/sw/python-2.7.9/lib/python2.7/pickle.py
Call to load_build on line 1211 of /hpc/sw/python-2.7.9/lib/python2.7/pickle.py
Call to load_stop on line 1254 of /hpc/sw/python-2.7.9/lib/python2.7/pickle.py
Call to __init__ on line 83 of /hpc/sw/python-2.7.9/lib/python2.7/pickle.py
Traceback (most recent call last):
  File "./signalclosefds.py", line 17, in <module>
    import easybuild.main
  File "/nfs/admin/hpc/sw/RedHatEnterpriseServer7/EasyBuild/3.9.0/lib/python2.7/site-packages/easybuild_framework-3.9.0-py2.7.egg/easybuild/main.py", line 46, in <module>
    from easybuild.tools.build_log import EasyBuildError, print_error, print_msg, stop_logging
  File "/nfs/admin/hpc/sw/RedHatEnterpriseServer7/EasyBuild/3.9.0/lib/python2.7/site-packages/easybuild_framework-3.9.0-py2.7.egg/easybuild/tools/build_log.py", line 44, in <module>
    from easybuild.tools.version import VERSION, this_is_easybuild
  File "/nfs/admin/hpc/sw/RedHatEnterpriseServer7/EasyBuild/3.9.0/lib/python2.7/site-packages/easybuild_framework-3.9.0-py2.7.egg/easybuild/tools/version.py", line 72, in <module>
    git_rev = get_git_revision()
  File "/nfs/admin/hpc/sw/RedHatEnterpriseServer7/EasyBuild/3.9.0/lib/python2.7/site-packages/easybuild_framework-3.9.0-py2.7.egg/easybuild/tools/version.py", line 58, in get_git_revision
    import git
  File "/hpc/sw/python-2.7.9/lib/python2.7/site-packages/git/__init__.py", line 82, in <module>
    refresh()
  File "/hpc/sw/python-2.7.9/lib/python2.7/site-packages/git/__init__.py", line 73, in refresh
    if not Git.refresh(path=path):
  File "/hpc/sw/python-2.7.9/lib/python2.7/site-packages/git/cmd.py", line 230, in refresh
    cls().version()
  File "/hpc/sw/python-2.7.9/lib/python2.7/site-packages/git/cmd.py", line 551, in <lambda>
    return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
  File "/hpc/sw/python-2.7.9/lib/python2.7/site-packages/git/cmd.py", line 1010, in _call_process
    return self.execute(call, **exec_kwargs)
  File "/hpc/sw/python-2.7.9/lib/python2.7/site-packages/git/cmd.py", line 732, in execute
    **subprocess_kwargs
  File "/hpc/sw/python-2.7.9/lib/python2.7/subprocess.py", line 710, in __init__
    errread, errwrite)
  File "/hpc/sw/python-2.7.9/lib/python2.7/subprocess.py", line 1335, in _execute_child
    raise child_exception
ValueError: Doh!

Surprise, surprise, the git module also uses close_fds:

paulm@int3 15:26 /dev/shm$ grep -C 10 close_fds /hpc/sw/python-2.7.9/lib/python2.7/site-packages/git/cmd.py
                  command, cwd, universal_newlines, shell)
        try:
            proc = Popen(command,
                         env=env,
                         cwd=cwd,
                         bufsize=-1,
                         stdin=istream,
                         stderr=PIPE,
                         stdout=stdout_sink,
                         shell=shell is not None and shell or self.USE_SHELL,
                         close_fds=is_posix,  # unsupported on windows
                         universal_newlines=universal_newlines,
                         creationflags=PROC_CREATIONFLAGS,
                         **subprocess_kwargs
                         )
        except cmd_not_found_exception as err:
            raise GitCommandNotFound(command, err)

        if as_process:
            return self.AutoInterrupt(proc, command)

paulm@int3 15:28 /dev/shm$ strace -f python -c 'import git' 2>&1 | grep EBAD
...
[pid 28970] close(16954)                = -1 EBADF (Bad file descriptor)
[pid 28970] close(16955)                = -1 EBADF (Bad file descriptor)
...

paulm@int3 15:28 /dev/shm$ time python -c 'import git'

real    0m0.981s
user    0m0.349s
sys 0m0.573s

The git module is used by easybuid/tools/version.py to determine the EB git revision being run. And that's more than a second spent on that piece of code to put the version info in the logs ;-)

paulm@int3 15:40 /dev/shm$ time PYTHONPATH=/nfs/admin/hpc/sw/RedHatEnterpriseServer7/EasyBuild/3.9.0/lib/python2.7/site-packages/easybuild_framework-3.9.0-py2.7.egg/easybuild/tools python -c 'import version; print version.this_is_easybuild()'
This is EasyBuild 3.9.0 (framework: 3.9.0, easyblocks: 0.0.UNKNOWN.EASYBLOCKS) on host int3.bullx.

real    0m1.337s
user    0m0.482s
sys 0m0.791s

By the way, no such difference when using close_fds=True on Lisa:

paulm@login3:~$ time ./t_open.py 

real    0m0.034s
user    0m0.012s
sys 0m0.016s

paulm@login3:~$ time ./t_open_closefds.py 

real    0m0.033s
user    0m0.008s
sys 0m0.020s

paulm@login3:~$ strace -f ./t_open_closefds.py 2>&1 | grep EBAD | tail -n 10
[pid 162761] close(4086)                = -1 EBADF (Bad file descriptor)
[pid 162761] close(4087)                = -1 EBADF (Bad file descriptor)
[pid 162761] close(4088)                = -1 EBADF (Bad file descriptor)
[pid 162761] close(4089)                = -1 EBADF (Bad file descriptor)
[pid 162761] close(4090)                = -1 EBADF (Bad file descriptor)
[pid 162761] close(4091)                = -1 EBADF (Bad file descriptor)
[pid 162761] close(4092)                = -1 EBADF (Bad file descriptor)
[pid 162761] close(4093)                = -1 EBADF (Bad file descriptor)
[pid 162761] close(4094)                = -1 EBADF (Bad file descriptor)
[pid 162761] close(4095)                = -1 EBADF (Bad file descriptor)

So here fds only up to 4095 are closed, hence the time difference.

paulmelis commented 4 years ago

I was just re-testing the t_open.py and t_open_closefds.py scripts on Cartesius and the difference appears to be gone with Python 3.7.5:

paulm@tcn901 16:07 ~$ time python ./t_open.py > /dev/null

real    0m0.071s
user    0m0.041s
sys 0m0.023s
paulm@tcn901 16:07 ~$ time python ./t_open_closefds.py > /dev/null

real    0m0.071s
user    0m0.046s
sys 0m0.018s

paulm@tcn901 16:10 ~$ python
Python 3.7.5 (default, Nov 27 2019, 09:05:07) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.sysconf("SC_OPEN_MAX")
1048576

Looking at the Python 3.4 release notes the issue was apparently well-known for a long time. In 3.4 new file descriptors where made non-inheritable (PEP046) effectively solving this issue.

So, when I did those tests above (about 7 months ago) did we indeed use a Python version before 3.4 (for EB)? That seems pretty unlikely, given that 3.4 was already released in March 2014.

Oh wait, was EB running on Python 2.x back then?

paulmelis commented 4 years ago

Seems the close FDs issue indeed doesn't occur anymore with EB running on 3.7.5 on Cartesius:

paulm@tcn901 16:18 ~$ EB_VERBOSE=1 eb --show-config
>> Considering 'python'...
>> 'python' version: 3.7.5, which matches Python 3 version requirement (>= 3.5)
>> Selected Python command: python (/sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/Python/3.7.5-foss-2018b/bin/python)
>> python -m easybuild.main --show-config

paulm@tcn901 16:17 ~$ time strace -o /dev/shm/out.fork -f -c python -m easybuild.main
...
paulm@tcn901 16:17 ~$ head -n 20 /dev/shm/out.fork
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 38.18    1.023438       12040        85           epoll_wait
 32.67    0.875701       22454        39        38 restart_syscall
  8.41    0.225525           7     30665     27267 open
  4.83    0.129411         128      1011         1 poll
  4.82    0.129218           6     23150     14253 stat
  2.49    0.066611         438       152        36 wait4
  1.52    0.040830       10208         4           select
  1.15    0.030714           4      8719         1 read
  1.08    0.028911           2     13436       174 lstat
  0.65    0.017410           3      6360         6 close
  0.53    0.014312           4      3405           mmap
  0.51    0.013550           8      1626         4 write
  0.47    0.012688           8      1580       550 openat
  0.26    0.007010           3      2799        33 fstat
  0.24    0.006560           5      1449           mprotect
  0.24    0.006496           8       860           pread64
  0.23    0.006187           6      1116           munmap
  0.20    0.005306           8       673           getdents
casparvl commented 4 years ago

@paulmelis How are you running with python 3? Our OS does not have a system Python 3... Do you have a local installation of python 3 / some module loaded?

[casparl@tcn900 ~]$ EB_VERBOSE=1 eb --version
>> Considering 'python'...
>> 'python' version: 2.7.5, which matches Python 2 version requirement (>= 2.6)
>> Selected Python command: python (/usr/bin/python)
>> python -m easybuild.main --version
This is EasyBuild 4.1.1 (framework: 4.1.1, easyblocks: 4.1.1) on host tcn900.bullx.
casparvl commented 4 years ago

Oh @boegel you're going to like this: Easybuild runs faster with Python 3. Nice argument to push people away from Python 2.7 ;-)

[casparl@tcn900 ~]$ time EB_VERBOSE=1 eb --version
>> Considering 'python'...
>> 'python' version: 2.7.5, which matches Python 2 version requirement (>= 2.6)
>> Selected Python command: python (/usr/bin/python)
>> python -m easybuild.main --version
This is EasyBuild 4.1.1 (framework: 4.1.1, easyblocks: 4.1.1) on host tcn900.bullx.

real    0m3.494s
user    0m2.266s
sys 0m1.181s
[casparl@tcn900 ~]$ module load Python/3.6.6-foss-2018b
[casparl@tcn900 ~]$ time EB_VERBOSE=1 eb --version
>> Considering 'python'...
>> 'python' version: 3.6.6, which matches Python 3 version requirement (>= 3.5)
>> Selected Python command: python (/sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/Python/3.6.6-foss-2018b/bin/python)
>> python -m easybuild.main --version
This is EasyBuild 4.1.1 (framework: 4.1.1, easyblocks: 4.1.1) on host tcn900.bullx.

real    0m1.299s
user    0m0.717s
sys 0m0.179s

Still not really fast though ;-)

paulmelis commented 4 years ago

@paulmelis How are you running with python 3? Our OS does not have a system Python 3... Do you have a local installation of python 3 / some module loaded?

Well, usually I do have a Python module loaded, as the system version is useless... Which is the reason I loaded it anyway for doing those tests.

But in this case that indeed influences what EB uses:

paulm@int3 17:44 ~$ m list
Currently Loaded Modulefiles:
 1) surfsara   2) 2019   3) EasyBuild/4.1.1   4) compilerwrappers   5) eb/4.1.1  
paulm@int3 17:44 ~$ EB_VERBOSE=1 eb 
>> Considering 'python'...
>> 'python' version: 2.7.5, which matches Python 2 version requirement (>= 2.6)
>> Selected Python command: python (/usr/bin/python)
>> python -m easybuild.main 
== temporary log file in case of crash /scratch-shared/paulm/eb-w7a9PR/easybuild-wv9N4C.log
Usage: eb [options] easyconfig [...]

...
paulm@int3 17:44 ~$ m load Python
paulm@int3 17:44 ~$ m list
Currently Loaded Modulefiles:
 1) surfsara                      11) XZ/5.2.4-GCCcore-7.3.0                      21) bzip2/1.0.6-GCCcore-7.3.0      
 2) 2019                          12) libxml2/2.9.8-GCCcore-7.3.0                 22) ncurses/6.1-GCCcore-7.3.0      
 3) EasyBuild/4.1.1               13) libpciaccess/0.14-GCCcore-7.3.0             23) libreadline/7.0-GCCcore-7.3.0  
 4) compilerwrappers              14) hwloc/1.11.10-GCCcore-7.3.0                 24) Tcl/8.6.8-GCCcore-7.3.0        
 5) eb/4.1.1                      15) OpenMPI/3.1.1-GCC-7.3.0-2.30                25) SQLite/3.24.0-GCCcore-7.3.0    
 6) GCCcore/7.3.0                 16) OpenBLAS/0.3.1-GCC-7.3.0-2.30               26) GMP/6.1.2-GCCcore-7.3.0        
 7) zlib/1.2.11-GCCcore-7.3.0     17) gompi/2018b                                 27) libffi/3.2.1-GCCcore-7.3.0     
 8) binutils/2.30-GCCcore-7.3.0   18) FFTW/3.3.8-gompi-2018b                      28) Python/3.7.5-foss-2018b        
 9) GCC/7.3.0-2.30                19) ScaLAPACK/2.0.2-gompi-2018b-OpenBLAS-0.3.1  
10) numactl/2.0.11-GCCcore-7.3.0  20) foss/2018b                                  
paulm@int3 17:44 ~$ EB_VERBOSE=1 eb 
>> Considering 'python'...
>> 'python' version: 3.7.5, which matches Python 3 version requirement (>= 3.5)
>> Selected Python command: python (/sw/arch/RedHatEnterpriseServer7/EB_production/2019/software/Python/3.7.5-foss-2018b/bin/python)
>> python -m easybuild.main 
== temporary log file in case of crash /scratch-shared/paulm/eb-b1ghq66m/easybuild-bjq4wozx.log

WARNING: Found one or more non-allowed loaded (EasyBuild-generated) modules in current environment:
* GCCcore/7.3.0
* GCCcore/7.3.0
...
paulmelis commented 4 years ago

I don't think I need to remind folks that Python 2.7 has reached the end of its life, and won't be maintained anymore, right?

paulmelis commented 4 years ago

real 0m3.494s user 0m2.266s sys 0m1.181s

real 0m1.299s user 0m0.717s sys 0m0.179s

Still not really fast though ;-)

Nice to see the system time go down from 34% to 14%

casparvl commented 4 years ago

Ok, so its no longer the close FDs issue. But the main time is still spent in I/O I think, if I understand the processes in the trace correctly. I've tried to check some lsof output, but I don't see a huge amount of files being opened (just some *.so's that you expect, related to python).

paulmelis commented 4 years ago

The 38% of time spent in epoll_wait is curious, that suggests a lot of waiting on something :)

paulm@int3 18:56 ~$ python -m cProfile -o log -s cumulative -m easybuild.main

paulm@int3 18:56 ~$ ls -l log
-rw-r-xr--+ 1 paulm paulm 365588 Jan 29 18:54 log

paulm@int3 18:57 ~$ cat printstats.py 
#!/usr/bin/env python
import sys, pstats
from pstats import SortKey

p = pstats.Stats(sys.argv[1])
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats()

paulm@int3 18:57 ~$ ./printstats.py log 
Wed Jan 29 18:54:26 2020    log

         705900 function calls (698584 primitive calls) in 2.996 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    312/1    0.003    0.000    2.996    2.996 {built-in method builtins.exec}
        1    0.000    0.000    2.996    2.996 runpy.py:195(run_module)
        1    0.000    0.000    2.985    2.985 runpy.py:62(_run_code)
        1    0.000    0.000    2.985    2.985 main.py:37(<module>)
        1    0.000    0.000    1.817    1.817 main.py:183(main)
   414/27    0.003    0.000    1.264    0.047 <frozen importlib._bootstrap>:978(_find_and_load)
   414/27    0.002    0.000    1.263    0.047 <frozen importlib._bootstrap>:948(_find_and_load_unlocked)
   361/30    0.002    0.000    1.254    0.042 <frozen importlib._bootstrap>:663(_load_unlocked)
   283/30    0.001    0.000    1.252    0.042 <frozen importlib._bootstrap_external>:722(exec_module)
       32    0.001    0.000    1.229    0.038 modules.py:1149(run_module)
       32    0.002    0.000    1.228    0.038 modules.py:1102(run_module)
       32    0.002    0.000    1.226    0.038 modules.py:723(run_module)
   517/56    0.000    0.000    1.222    0.022 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
        1    0.001    0.001    1.133    1.133 modules.py:822(check_loaded_modules)
       28    0.000    0.000    1.100    0.039 modules.py:637(show)
       32    0.000    0.000    1.092    0.034 subprocess.py:920(communicate)
       32    0.004    0.000    1.091    0.034 subprocess.py:1662(_communicate)
      426    0.002    0.000    1.078    0.003 selectors.py:402(select)
      426    1.076    0.003    1.076    0.003 {method 'poll' of 'select.poll' objects}
        1    0.000    0.000    0.642    0.642 easyblock.py:40(<module>)
   117/73    0.000    0.000    0.631    0.009 {built-in method builtins.__import__}
        5    0.000    0.000    0.617    0.123 __init__.py:25(<module>)
        1    0.000    0.000    0.538    0.538 easyconfig.py:39(<module>)
        1    0.000    0.000    0.473    0.473 options.py:1319(set_up_configuration)
      284    0.004    0.000    0.433    0.002 <frozen importlib._bootstrap_external>:793(get_code)
        1    0.000    0.000    0.404    0.404 build_log.py:33(<module>)
1511/1322    0.001    0.000    0.387    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
        1    0.000    0.000    0.366    0.366 constants.py:32(<module>)
        1    0.000    0.000    0.342    0.342 systemtools.py:30(<module>)
        1    0.000    0.000    0.321    0.321 filetools.py:40(<module>)
        1    0.000    0.000    0.316    0.316 fancylogger.py:75(<module>)
        1    0.000    0.000    0.305    0.305 __init__.py:41(<module>)
      401    0.004    0.000    0.262    0.001 <frozen importlib._bootstrap>:882(_find_spec)
  314/313    0.001    0.000    0.262    0.001 <frozen importlib._bootstrap>:576(module_from_spec)
      398    0.000    0.000    0.256    0.001 <frozen importlib._bootstrap_external>:1272(find_spec)
      398    0.006    0.000    0.256    0.001 <frozen importlib._bootstrap_external>:1240(_get_spec)
    28/27    0.000    0.000    0.251    0.009 <frozen importlib._bootstrap_external>:1040(create_module)
...
paulmelis commented 4 years ago

Btw, to really do this kind of performance profiling you would need to install and run EB on a local filesystem first and optimize there.

paulmelis commented 4 years ago

I haven't looked closely into the EB code, but glancing briefly at the profiling trace above a lot of time is spent loading Python modules (probably .eb files getting loaded as modules). I guess these aren't cached as .pyc files as is done normally, but maybe that can be an interesting avenue to check? If loading the modules is indeed a time-consuming part.

boegel commented 4 years ago

Thanks a lot for the detailed analysis @paulmelis & @casparvl!

It doesn't surprise me much that startup time on top of Python 3 is a lot better than it is with Python 2.7, I'm sure the Python core developers have done quite a bit of work on that.

EasyBuild 3.x is indeed Python 2 only, so there's no way those number were collected with Python 3.x.

EasyBuild will only parse easyconfig files (*.eb) when it needs to, i.e. when they're specified as easyconfigs to process, or when it needs to parse them to resolve dependencies, so that's not part of startup time at all.

The EasyBuild framework currently has 224 Python files, most of which will be loaded & parsed during startup. We also have 278 Python files in easyblocks, but I don't expect those to be loaded/parsed at startup, only as needed (not entirely sure about that though).

I'll try to find time to deep dive into this myself too, but a big factor of the slow startup probably stems from having EasyBuild installed on a shared filesystem, where loading a significant amount of small files is relatively expensive...

akesandgren commented 4 years ago

If i try the "python -m cProfile -o log -s cumulative -m easybuild.main" command i get Usage: cProfile.py [-o output_file_path] [-s sort] scriptfile [arg] ...

cProfile.py: error: no such option: -m

paulmelis commented 4 years ago

If i try the "python -m cProfile -o log -s cumulative -m easybuild.main" command i get Usage: cProfile.py [-o output_file_path] [-s sort] scriptfile [arg] ...

cProfile.py: error: no such option: -m

It's probably because I tested that with Python 3.x

akesandgren commented 4 years ago

Doh...

paulmelis commented 4 years ago

Doh...

Sorry, I just don't use Python 2.x anymore. But the same type of profiling is in 2.x, no doubt.

paulmelis commented 4 years ago

EasyBuild 3.x is indeed Python 2 only, so there's no way those number were collected with Python 3.x.

I think all my tests above where done with EB 4.x Edit: the original tests where with 3.x (this is the stuff that Caspar copied from an internal issue), but the later ones are with 4.x

boegel commented 4 years ago

@akesandgren You can't use -m twice, try this instead:

python -O -m cProfile -o log -s cumulative easybuild/main.py

(make sure you also use -O like eb does, that's important for good performance)

paulmelis commented 4 years ago

make sure you also use -O like eb does, that's important for good performance

Are there really that many asserts and the like in the EB sources? AFAIK the -O option merely disables such debug statements, it doesn't do any code optimization.

bartoldeman commented 2 years ago

3610 made this problem worse, since it tries $python_cmd -c "import easybuild.main" for python_cmd in ${EB_PYTHON} ${EB_INSTALLPYTHON} 'python' 'python3' 'python2' which is rather expensive, before even starting EB. Using just $python_cmd -c "import easybuild" is much faster, but not sure if this is sufficient to test?

boegel commented 2 years ago

@bartoldeman Testing with easybuild.framework has the same advantage as only testing with easybuild, while still checking whether EasyBuild framework is available (just checking with import easybuild doesn't), cfr. #3998