NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
16.68k stars 13.13k forks source link

Nix build blocked by deadlocked python tests #217247

Open davidak opened 1 year ago

davidak commented 1 year ago

Describe the bug

I used nixpkgs-review to build a PR that affect many packages (https://github.com/NixOS/nixpkgs/pull/216403).

After 15 hours, the CPU was still at 100% load, but the temperature low, as if it is not doing anything.

[root@gaming:~]# top

top - 13:02:36 up 13 days, 20:17,  1 user,  load average: 32.16, 32.26, 32.18
Tasks: 354 total,   5 running, 349 sleeping,   0 stopped,   0 zombie
%Cpu(s): 95.2 us,  4.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  32027.9 total,  24085.9 free,   3354.6 used,   4587.4 buff/cache
MiB Swap:  32768.0 total,  26143.9 free,   6624.0 used.  28058.6 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                 
1501010 nixbld14  20   0 5432076  88508  17160 R 733.3   0.3   4247:32 python3.11                                                                                                              
1533870 nixbld23  20   0 6134444  75340  15304 R 730.3   0.2   4236:56 python3.10                                                                                                              
1622385 nixbld18  20   0 3866172 427812  33696 R  72.7   1.3 344:42.42 python3.11                                                                                                              
1622408 nixbld20  20   0 3936212 707792  34208 R  48.7   2.2 339:52.25 python3.10         

There are 4 python package builds that seem to be stuck.

They are running pytest.

root        1272  0.0  0.0 445100   616 ?        Ssl  Feb01   0:00 nix-daemon --daemon
root     1362239  1.1  0.1 2331936 61520 ?       Ssl  01:25   7:38  \_ nix-daemon 1362236
nixbld14 1480363  0.0  0.0   7544  1320 ?        Ss   01:52   0:00      \_ bash -e /nix/store/6xg259477c90a229xwmb53pdfkn6ig3g-default-builder.sh
nixbld14 1501010  639  0.2 5432076 88508 ?       Rl   01:58 4203:02      |   \_ /nix/store/r85cm22i6058g4zj76w05rzvqwil6d2g-python3-3.11.1/bin/python3.11 -m pytest -k not test_tv_inpa
nixbld23 1505345  0.0  0.0   7536  1320 ?        Ss   01:59   0:00      \_ bash -e /nix/store/6xg259477c90a229xwmb53pdfkn6ig3g-default-builder.sh
nixbld23 1533870  642  0.2 6134444 75340 ?       Rl   02:04 4192:28      |   \_ /nix/store/0pyymzxf7n0fzpaqnvwv92ab72v3jq8d-python3-3.10.9/bin/python3.10 -m pytest -k not test_tv_inpa
nixbld20 1569160  0.0  0.0   7880  1328 ?        Ss   02:12   0:00      \_ bash -e /nix/store/6xg259477c90a229xwmb53pdfkn6ig3g-default-builder.sh
nixbld20 1622408 53.8  2.1 3936212 707792 ?      R    02:31 336:25      |   \_ /nix/store/0pyymzxf7n0fzpaqnvwv92ab72v3jq8d-python3-3.10.9/bin/python3.10 -m pytest
nixbld18 1571036  0.0  0.0   7880  1328 ?        Ss   02:14   0:00      \_ bash -e /nix/store/6xg259477c90a229xwmb53pdfkn6ig3g-default-builder.sh
nixbld18 1622385 54.6  1.3 3866172 427812 ?      R    02:31 341:11          \_ /nix/store/r85cm22i6058g4zj76w05rzvqwil6d2g-python3-3.11.1/bin/python3.11 -m pytest

but the build is stuck at the same test for hours...

[4/69/77 built (12 failed), 1989 copied (17709.2/17716.9 MiB), 4213.5 MiB DL] building python3.10-cvxpy-1.2.3 (pytestCheckPhase): .........

2 processes only do futex calls and the other 2 sched_yield

:: (nixbld13) → /tmp/nix-build-python3.10-OpenSfM-unstable-2022-03-10.drv-0
  UID     PID    PPID STIME     TIME COMMAND
30013 3067079 3055397 20:48 00:00:00 bash -e /nix/store/6xg259477c90a229xwmb53pdfkn6ig3g-default-builder.sh
30013 3101891 3067079 20:53 01:32:21 /nix/store/0pyymzxf7n0fzpaqnvwv92ab72v3jq8d-python3-3.10.9/bin/python3.10 -m pytest

strace -ff -p 3101891
...
[pid 3101891] sched_yield( <unfinished ...>
[pid 3102726] <... sched_yield resumed>) = 0
[pid 3102725] sched_yield( <unfinished ...>
[pid 3101891] <... sched_yield resumed>) = 0
^C

[root@gaming:~]# strace -f -c -p 3101891
strace: Process 3101891 attached with 3 threads
...
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   20.935565           8   2546114           sched_yield
------ ----------- ----------- --------- --------- ----------------
100.00   20.935565           8   2546114           total
:: (nixbld14) → /tmp/nix-build-python3.10-qutip-4.7.1.drv-0
  UID     PID    PPID STIME     TIME COMMAND
30014 3067093 3055397 20:48 00:00:00 bash -e /nix/store/6xg259477c90a229xwmb53pdfkn6ig3g-default-builder.sh
30014 3139076 3067093 20:56 04:31:17 /nix/store/0pyymzxf7n0fzpaqnvwv92ab72v3jq8d-python3-3.10.9/bin/python3.10 -c import qutip.testing; qutip.testing.run()

strace -ff -p 3139076
...
[pid 3139643] futex(0x1c7fd44, FUTEX_WAIT_PRIVATE, 595632, NULL <unfinished ...>
[pid 3139645] futex(0x1c7fd44, FUTEX_WAIT_PRIVATE, 595632, NULL <unfinished ...>
[pid 3139642] futex(0x1c7fd44, FUTEX_WAIT_PRIVATE, 595632, NULL <unfinished ...>
[pid 3139646] futex(0x1c7fd44, FUTEX_WAIT_PRIVATE, 595632, NULL <unfinished ...>
[pid 3139647] futex(0x1c7fd44, FUTEX_WAKE_PRIVATE, 2147483647) = 4
[pid 3139645] <... futex resumed>)      = 0
[pid 3139643] <... futex resumed>)      = 0
[pid 3139642] <... futex resumed>)      = 0
[pid 3139646] <... futex resumed>)      = 0
[pid 3139646] futex(0x121a2e4, FUTEX_WAIT_PRIVATE, 1955808, NULL <unfinished ...>
[pid 3139642] futex(0x121a2e4, FUTEX_WAIT_PRIVATE, 1955808, NULL <unfinished ...>
[pid 3139645] futex(0x121a2e4, FUTEX_WAIT_PRIVATE, 1955808, NULL <unfinished ...>
[pid 3139643] futex(0x121a2e4, FUTEX_WAIT_PRIVATE, 1955808, NULL <unfinished ...>
[pid 3139652] futex(0x121a2e4, FUTEX_WAKE_PRIVATE, 2147483647) = 2
[pid 3139642] <... futex resumed>)      = 0
[pid 3139646] <... futex resumed>)      = 0
[pid 3139645] <... futex resumed>)      = -1 EAGAIN (Resource temporarily unavailable)
[pid 3139643] <... futex resumed>)      = -1 EAGAIN (Resource temporarily unavailable)
[pid 3139641] futex(0x1c7fd44, FUTEX_WAIT_PRIVATE, 595640, NULL^Cstrace: Process 3139076 detached

[root@gaming:~]# strace -f -c -p 3139076
strace: Process 3139076 attached with 16 threads
...
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.286391          66      4322      1070 futex
------ ----------- ----------- --------- --------- ----------------
100.00    0.286391          66      4322      1070 total
:: (nixbld16) → /tmp/nix-build-python3.11-OpenSfM-unstable-2022-03-10.drv-0
  UID     PID    PPID STIME     TIME COMMAND
30016 3067100 3055397 20:48 00:00:00 bash -e /nix/store/6xg259477c90a229xwmb53pdfkn6ig3g-default-builder.sh
30016 3100028 3067100 20:53 01:34:09 /nix/store/r85cm22i6058g4zj76w05rzvqwil6d2g-python3-3.11.1/bin/python3.11 -m pytest

strace -ff -p 3100028
...
[pid 3100638] sched_yield( <unfinished ...>
[pid 3100637] <... sched_yield resumed>) = 0
[pid 3100638] <... sched_yield resumed>) = 0
[pid 3100028] sched_yield( <unfinished ...>
[pid 3100638] sched_yield( <unfinished ...>
[pid 3100637] sched_yield( <unfinished ...>
[pid 3100638] <... sched_yield resumed>) = 0
[pid 3100028] <... sched_y^C

[root@gaming:~]# strace -f -c -p 3100028
strace: Process 3100028 attached with 3 threads
...
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   17.474801           8   2085520           sched_yield
------ ----------- ----------- --------- --------- ----------------
100.00   17.474801           8   2085520           total
:: (nixbld17) → /tmp/nix-build-python3.11-cvxpy-1.2.3.drv-0
  UID     PID    PPID STIME     TIME COMMAND
30017 3067103 3055397 20:48 00:00:00 bash -e /nix/store/6xg259477c90a229xwmb53pdfkn6ig3g-default-builder.sh
30017 3086650 3067103 20:51 05:15:56 /nix/store/r85cm22i6058g4zj76w05rzvqwil6d2g-python3-3.11.1/bin/python3.11 -m pytest -k not test_tv_inpainting and not test_diffcp_sdp_example and not test_huber and not test_partial_problem ./cvxpy

strace -ff -p 3086650
...
[pid 3088301] futex(0x1e54ff4, FUTEX_WAIT_PRIVATE, 715744, NULL <unfinished ...>
[pid 3088293] futex(0x1e54ff4, FUTEX_WAKE_PRIVATE, 2147483647) = 1
[pid 3088301] <... futex resumed>)      = 0
[pid 3088301] futex(0x1e4b784, FUTEX_WAIT_PRIVATE, 715744, NULL <unfinished ...>
[pid 3088292] futex(0x1e4b784, FUTEX_WAIT_PRIVATE, 715744, NULL <unfinished ...>
[pid 3088298] futex(0x1e4b784, FUTEX_WAKE_PRIVATE, 2147483647) = 2
[pid 3088301] <... futex resumed>)      = 0
[pid 3088292] <... futex resumed>)      = 0
[pid 3088301] futex(0x1e54ff4, FUTEX_WAIT_PRIVATE, 715752, NULL <unfinished ...>
[pid 3088305] futex(0x1e54ff4, FUTEX_WAKE_PRIVATE, 2147483647) = 0
[pid 3088301] <... futex resumed>)      = -1 EAGAIN (Resource temporarily unavailable)
[pid 3088303] futex(0x1e4b784, FUTEX_WAKE_PRIVATE, 2147483647) = 0
[pid 3088297] futex(0x1e54ff4, FUTEX_WAKE_PRIVATE, 2147483647) = 0
^C

[root@gaming:~]# strace -f -c -p 3086650
strace: Process 3086650 attached with 23 threads
...
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.464360          71      6463      1839 futex
------ ----------- ----------- --------- --------- ----------------
100.00    0.464360          71      6463      1839 total

Maybe this is a pytest issue, but Nix should also do something to unblock such situations!

Workaround

To unblock the situation and let Nix build the rest of the packages, you have to kill the deadlocked python test processes.

error: builder for '/nix/store/r9msjyv7p81fl9llcz6hzp8czr5ml83w-python3.11-OpenSfM-unstable-2022-03-10.drv' failed with exit code 143;
       last 10 log lines:
       > opensfm/test/test_dense.py ..                                            [ 55%]
       > opensfm/test/test_geo.py .....                                           [ 57%]
       > opensfm/test/test_geometry.py ....                                       [ 59%]
       > opensfm/test/test_io.py .........                                        [ 62%]
       > opensfm/test/test_matching.py ......                                     [ 65%]
       > opensfm/test/test_multiview.py .........                                 [ 68%]
       > opensfm/test/test_pairs_selection.py .........                           [ 72%]
       > opensfm/test/test_reconstruction_alignment.py ........                   [ 75%]
       > opensfm/test/test_reconstruction_incremental.py Terminated
       > /nix/store/b09v23lirgvci3wzszh22mbkdfj0h0yq-stdenv-linux/setup: line 1582: pop_var_context: head of shell_variables not a function context

The "Terminated" appears.

Also the strange stdenv-linux/setup error.

It's interesting that even when i killed 3 of 4, the last one is not unblocked. So they seem not to get blocked on each other. But the issue also don't appear when building the package alone!

Steps To Reproduce

  1. nixpkgs-review pr 215689 (Builds 138 packages which need 17 GB!)

Not reproducible with e.g.:

nix-build . -A python310Packages.cvxpy -A python310Packages.qutip -A python310Packages.opensfm -A python311Packages.opensfm

Expected behavior

Nix should be able to build 100+ packages without errors.

nix-env --version output nix-env (Nix) 2.11.1

Additional context

Packages that get stuck in this way:

python3.11-OpenSfM-unstable-2022-03-10 python3.10-OpenSfM-unstable-2022-03-10 python3.10-qutip-4.7.1 python3.11-cvxpy-1.2.3 python3.10-cvxpy-1.2.3

Priorities

Add :+1: to issues you find important.

davidak commented 1 year ago

I had the same issue again with https://github.com/NixOS/nixpkgs/pull/199813. 2 processes building a package each of

python310Packages.cvxpy python311Packages.cvxpy python310Packages.qutip python311Packages.qutip

When i killed one, the rest are finished in minutes. Before it was stuck for 16 hours!

davidak commented 1 year ago

I had the issue again with https://github.com/NixOS/nixpkgs/pull/216399. 2 package builds where blocked. They did only sched_yield calls.

python310Packages.opensfm python311Packages.opensfm

Build not finished after 3.5 hours! Killing one build did not help to unblock the other.

The issue is also reproducible when building one package alone!

nix-build . -A python310Packages.opensfm

The pytests get stuck at 75%:

...
opensfm/test/test_dense.py ..                                            [ 55%]
opensfm/test/test_geo.py .....                                           [ 57%]
opensfm/test/test_geometry.py ....                                       [ 59%]
opensfm/test/test_io.py .........                                        [ 62%]
opensfm/test/test_matching.py ......                                     [ 65%]
opensfm/test/test_multiview.py .........                                 [ 68%]
opensfm/test/test_pairs_selection.py .........                           [ 72%]
opensfm/test/test_reconstruction_alignment.py ........                   [ 75%]
opensfm/test/test_reconstruction_incremental.py Terminated

When i look into it with strace, it has 5 threads. Maybe they create the deadlock.

[root@gaming:~]# strace -f -c -p 2468163
strace: Process 2468163 attached with 5 threads
^Cstrace: Process 2468163 detached
strace: Process 2468290 detached
strace: Process 2468291 detached
strace: Process 2468292 detached
strace: Process 2475812 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   20.227928           7   2786017           sched_yield
------ ----------- ----------- --------- --------- ----------------
100.00   20.227928           7   2786017           total

When i started the build again, it worked successfully.

I also learned that deadlocks are a common issue in Python. So maybe we have to report them to the affected packages.

https://pythonspeed.com/articles/python-multiprocessing/ https://rachelbythebay.com/w/2011/06/07/forked/ https://github.com/python/cpython/issues/50970 https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods https://github.com/orgs/python/projects/12/views/1

roberth commented 1 year ago

A futex can only be used to communicate with child processes until they exec, and Nix always execs the builder, so Nix can not be involved in this deadlock. I'll move this issue to nixpkgs.

davidak commented 1 year ago

@roberth Nix could still handle such situations better, e.g. detecting deadlocks and cancel builds after a timeout.

roberth commented 1 year ago

It can already cancel a build after log lines stop appearing, with max-silent-time.

I don't think detecting the deadlock itself is feasible without significant overhead.