apache / buildstream

BuildStream, the software integration tool
https://buildstream.build/
Apache License 2.0
82 stars 28 forks source link

Terminating jobs doesn't work #1613

Open abderrahim opened 2 years ago

abderrahim commented 2 years ago

Interrupt jobs doesn't work, I noticed this with fetch jobs before (where it seems that bst hanged but git is still downloading in the background) and today I noticed it with build jobs too. Here are the logs: notice that it says STATUS Build terminating for all four build jobs, yet they continue until they end with SUCCESS Caching artifact.

User interrupted with ^C

Choose one of the following options:
  (c)ontinue  - Continue queueing jobs as much as possible
  (q)uit      - Exit after all ongoing jobs complete
  (t)erminate - Terminate any ongoing jobs and exit

Pressing ^C again will terminate jobs and exit

Choice: [continue]: t

Terminating all jobs at user request

[09:46:24][--:--:--][db2e4e4e][    main:sdk/vala-tarball.bst          ] STATUS  Build terminating
[09:46:24][--:--:--][b503e9e5][    main:sdk/gobject-introspection.bst ] STATUS  Build terminating
[09:46:24][--:--:--][a39842c0][    main:freedesktop-sdk.bst:components/libgudev.bst] STATUS  Build terminating
[09:46:24][--:--:--][ecec38b2][    main:sdk/enchant-2.bst             ] STATUS  Build terminating
[09:46:30][00:00:12][db2e4e4e][   build:sdk/vala-tarball.bst          ] SUCCESS Integrating sandbox
[09:46:30][--:--:--][db2e4e4e][   build:sdk/vala-tarball.bst          ] START   Staging sources
[09:46:30][00:00:11][ecec38b2][   build:sdk/enchant-2.bst             ] SUCCESS Integrating sandbox
[09:46:30][--:--:--][ecec38b2][   build:sdk/enchant-2.bst             ] START   Staging sources
[09:46:30][00:00:00][db2e4e4e][   build:sdk/vala-tarball.bst          ] SUCCESS Staging sources
[09:46:30][--:--:--][db2e4e4e][   build:sdk/vala-tarball.bst          ] START   Running commands

    export NOCONFIGURE=1;

    if [ -x ./configure ]; then true;
    elif [ -x ./autogen ]; then ./autogen;
    elif [ -x ./autogen.sh ]; then ./autogen.sh;
    elif [ -x ./bootstrap ]; then ./bootstrap;
    elif [ -x ./bootstrap.sh ]; then ./bootstrap.sh;
    else autoreconf -ivf .;
    fi
    ./configure --prefix=/usr \
    --exec-prefix=/usr \
    --bindir=/usr/bin \
    --sbindir=/usr/bin \
    --sysconfdir=/etc \
    --datadir=/usr/share \
    --includedir=/usr/include \
    --libdir=/usr/lib/x86_64-linux-gnu \
    --libexecdir=/usr/libexec \
    --localstatedir=/var \
    --sharedstatedir=/usr/com \
    Message contains 42 additional lines

[09:46:30][00:00:00][ecec38b2][   build:sdk/enchant-2.bst             ] SUCCESS Staging sources
[09:46:30][--:--:--][ecec38b2][   build:sdk/enchant-2.bst             ] START   Running commands

    export NOCONFIGURE=1;

    if [ -x ./configure ]; then true;
    elif [ -x ./autogen ]; then ./autogen;
    elif [ -x ./autogen.sh ]; then ./autogen.sh;
    elif [ -x ./bootstrap ]; then ./bootstrap;
    elif [ -x ./bootstrap.sh ]; then ./bootstrap.sh;
    else autoreconf -ivf .;
    fi
    ./configure --prefix=/usr \
    --exec-prefix=/usr \
    --bindir=/usr/bin \
    --sbindir=/usr/bin \
    --sysconfdir=/etc \
    --datadir=/usr/share \
    --includedir=/usr/include \
    --libdir=/usr/lib/x86_64-linux-gnu \
    --libexecdir=/usr/libexec \
    --localstatedir=/var \
    --sharedstatedir=/usr/com \
    Message contains 42 additional lines

[09:46:30][00:00:13][a39842c0][   build:freedesktop-sdk.bst:components/libgudev.bst] SUCCESS Integrating sandbox
[09:46:30][--:--:--][a39842c0][   build:freedesktop-sdk.bst:components/libgudev.bst] START   Staging sources
[09:46:30][00:00:00][a39842c0][   build:freedesktop-sdk.bst:components/libgudev.bst] SUCCESS Staging sources
[09:46:30][--:--:--][a39842c0][   build:freedesktop-sdk.bst:components/libgudev.bst] START   Running commands

    meson . _builddir --prefix=/usr \
    --bindir=/usr/bin \
    --sbindir=/usr/bin \
    --sysconfdir=/etc \
    --datadir=/usr/share \
    --includedir=/usr/include \
    --libdir=/usr/lib/x86_64-linux-gnu \
    --libexecdir=/usr/libexec \
    --localstatedir=/var \
    --sharedstatedir=/usr/com \
    --mandir=/usr/share/man \
    --infodir=/usr/share/info  --buildtype=plain -Dauto_features=enabled -Db_pie=true -Ddefault_library=shared -Db_lto=true -Db_lto_threads=${LTOJOBS} -Dtests=disabled -Dvapi=enabled -Dintrospection=enabled
    ninja -v -j ${NINJAJOBS} -C _builddir
    env DESTDIR="/buildstream-install" ninja -C _builddir install
    OPTS=()
    if ! "true"; then
      OPTS+=("-n")
    fi
    if ! "true"; then
      OPTS+=("-p")
    Message contains 13 additional lines

[09:46:31][00:00:12][b503e9e5][   build:sdk/gobject-introspection.bst ] SUCCESS Integrating sandbox
[09:46:31][--:--:--][b503e9e5][   build:sdk/gobject-introspection.bst ] START   Staging sources
[09:46:31][00:00:00][b503e9e5][   build:sdk/gobject-introspection.bst ] SUCCESS Staging sources
[09:46:31][--:--:--][b503e9e5][   build:sdk/gobject-introspection.bst ] START   Running commands

    meson . _builddir --prefix=/usr \
    --bindir=/usr/bin \
    --sbindir=/usr/bin \
    --sysconfdir=/etc \
    --datadir=/usr/share \
    --includedir=/usr/include \
    --libdir=/usr/lib/x86_64-linux-gnu \
    --libexecdir=/usr/libexec \
    --localstatedir=/var \
    --sharedstatedir=/usr/com \
    --mandir=/usr/share/man \
    --infodir=/usr/share/info  --buildtype=plain --auto-features=enabled -Dgtk_doc=true -Dcairo=disabled -Ddoctool=enabled
    ninja -j ${NINJAJOBS} -C _builddir
    env DESTDIR="/buildstream-install" ninja -C _builddir install
    OPTS=()
    if ! "true"; then
      OPTS+=("-n")
    fi
    if ! "true"; then
      OPTS+=("-p")
    Message contains 13 additional lines

[09:46:56][00:00:25][a39842c0][   build:freedesktop-sdk.bst:components/libgudev.bst] SUCCESS Running commands
[09:46:56][--:--:--][a39842c0][   build:freedesktop-sdk.bst:components/libgudev.bst] START   Caching artifact
[09:46:56][00:00:00][a39842c0][   build:freedesktop-sdk.bst:components/libgudev.bst] SUCCESS Caching artifact
[09:46:56][00:00:40][a39842c0][   build:freedesktop-sdk.bst:components/libgudev.bst] SUCCESS freedesktop-sdk/components-libgudev/a39842c0-build.12749.log
[09:47:13][00:00:42][ecec38b2][   build:sdk/enchant-2.bst             ] SUCCESS Running commands
[09:47:13][--:--:--][ecec38b2][   build:sdk/enchant-2.bst             ] START   Caching artifact
[09:47:13][00:00:00][ecec38b2][   build:sdk/enchant-2.bst             ] SUCCESS Caching artifact
[09:47:13][00:00:57][ecec38b2][   build:sdk/enchant-2.bst             ] SUCCESS gnome/sdk-enchant-2/ecec38b2-build.12749.log
[09:47:15][00:00:44][db2e4e4e][   build:sdk/vala-tarball.bst          ] SUCCESS Running commands
[09:47:15][--:--:--][db2e4e4e][   build:sdk/vala-tarball.bst          ] START   Caching artifact
[09:47:15][00:00:00][db2e4e4e][   build:sdk/vala-tarball.bst          ] SUCCESS Caching artifact
[09:47:15][00:00:59][db2e4e4e][   build:sdk/vala-tarball.bst          ] SUCCESS gnome/sdk-vala-tarball/db2e4e4e-build.12749.log
[09:47:16][00:00:45][b503e9e5][   build:sdk/gobject-introspection.bst ] SUCCESS Running commands
[09:47:16][--:--:--][b503e9e5][   build:sdk/gobject-introspection.bst ] START   Caching artifact
[09:47:16][00:00:00][b503e9e5][   build:sdk/gobject-introspection.bst ] SUCCESS Caching artifact
[09:47:16][00:01:01][b503e9e5][   build:sdk/gobject-introspection.bst ] SUCCESS gnome/sdk-gobject-introspection/b503e9e5-build.12749.log
[09:47:16][00:01:21][        ][    main:core activity                 ] WARNING Build Terminated

Pipeline Summary
    Total:       805
    Session:     805
    Fetch Queue: processed 7, skipped 697, failed 0 
    Build Queue: processed 2, skipped 469, failed 0 
gtristan commented 2 years ago

This must have changed with buildbox-run and buildbox-casd.

It's worth investigating if suspending jobs works to suspend building, or the uploading / downloading of an artifact (i.e. if we properly send SIGTSTP to the buildbox-casd/buildbox-run processes or not) - and consider whether we should be and if that is well supported.

abderrahim commented 2 years ago

I noticed something else today: terminating jobs works. I think this only happens in specific cases.

nanonyme commented 2 years ago

Maybe it's some race condition which results signal in being lost?

fishrockz commented 2 years ago

it definitely hangs horribly for trying to terminate ostree sources.

User interrupted with ^C

Choose one of the following options:
  (c)ontinue  - Continue queueing jobs as much as possible
  (q)uit      - Exit after all ongoing jobs complete
  (t)erminate - Terminate any ongoing jobs and exit

Pressing ^C again will terminate jobs and exit

Choice: [continue]: t

Terminating all jobs at user request

[--:--:--][20887c2e][    main:bootstrap/build/base-sdk/image-x86_64.bst] STATUS  Fetch terminating
^C[00:01:24][20887c2e][   fetch:bootstrap/build/base-sdk/image-x86_64.bst] BUG     Fetch

    An unhandled exception occured:

    Traceback (most recent call last):
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/buildstream/_signals.py", line 113, in terminator
        yield
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/buildstream/utils.py", line 1398, in _call
        output, _ = process.communicate(timeout=1)
      File "/usr/lib64/python3.10/subprocess.py", line 1149, in communicate
        stdout, stderr = self._communicate(input, endtime, timeout)
      File "/usr/lib64/python3.10/subprocess.py", line 2026, in _communicate
        self.wait(timeout=self._remaining_time(endtime))
      File "/usr/lib64/python3.10/subprocess.py", line 1204, in wait
        return self._wait(timeout=timeout)
      File "/usr/lib64/python3.10/subprocess.py", line 1932, in _wait
        time.sleep(delay)
    buildstream._signals.TerminateException

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/psutil/_common.py", line 441, in wrapper
        ret = self._cache[fun]
    AttributeError: 'Process' object has no attribute '_cache'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/psutil/_pslinux.py", line 1661, in wrapper
        return fun(self, *args, **kwargs)
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/psutil/_common.py", line 444, in wrapper
        return fun(self)
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/psutil/_pslinux.py", line 1703, in _parse_stat_file
        with open_binary("%s/%s/stat" % (self._procfs_path, self.pid)) as f:
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/psutil/_common.py", line 711, in open_binary
        return open(fname, "rb", **kwargs)
    FileNotFoundError: [Errno 2] No such file or directory: '/proc/11107/stat'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/psutil/__init__.py", line 361, in _init
        self.create_time()
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/psutil/__init__.py", line 717, in create_time
        self._create_time = self._proc.create_time()
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/psutil/_pslinux.py", line 1661, in wrapper
        return fun(self, *args, **kwargs)
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/psutil/_pslinux.py", line 1873, in create_time
        ctime = float(self._parse_stat_file()['create_time'])
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/psutil/_pslinux.py", line 1668, in wrapper
        raise NoSuchProcess(self.pid, self._name)
    psutil.NoSuchProcess: process no longer exists (pid=11107)

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/buildstream/_scheduler/jobs/job.py", line 441, in child_action
        result = self.child_process()  # pylint: disable=assignment-from-no-return
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/buildstream/_scheduler/jobs/elementjob.py", line 92, in child_process
        return self._action_cb(self._element)
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/buildstream/_scheduler/queues/fetchqueue.py", line 77, in _fetch_not_original
        element._fetch(fetch_original=False)
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/buildstream/element.py", line 2185, in _fetch
        self.__sources.fetch()
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/buildstream/_elementsources.py", line 225, in fetch
        self.fetch_sources()
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/buildstream/_elementsources.py", line 254, in fetch_sources
        self._fetch_source(source)
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/buildstream/_elementsources.py", line 435, in _fetch_source
        source._fetch()
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/buildstream/source.py", line 802, in _fetch
        self.__do_fetch()
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/buildstream/source.py", line 1289, in __do_fetch
        new_source.fetch(**kwargs)
      File "/home/will/projects/buildsystems/venvbuild/lib/python3.10/site-packages/bst_plugins_experimental/sources/ostree.py", line 159, in fetch
        self.call(
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/buildstream/plugin.py", line 732, in call
        exit_code, _ = self.__call(
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/buildstream/plugin.py", line 954, in __call
        exit_code, output = utils._call(args, cwd=cwd, env=env, stdin=stdin, stdout=stdout, stderr=stderr)
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/buildstream/utils.py", line 1386, in _call
        with _signals.suspendable(suspend_proc, resume_proc), _signals.terminator(kill_proc), subprocess.Popen(
      File "/usr/lib64/python3.10/contextlib.py", line 153, in __exit__
        self.gen.throw(typ, value, traceback)
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/buildstream/_signals.py", line 115, in terminator
        terminate_func()
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/buildstream/utils.py", line 1374, in kill_proc
        _kill_process_tree(process.pid)
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/buildstream/utils.py", line 1293, in _kill_process_tree
        proc = psutil.Process(pid)
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/psutil/__init__.py", line 332, in __init__
        self._init(pid)
      File "/home/will/projects/buildsystems/venvbuild/lib64/python3.10/site-packages/psutil/__init__.py", line 373, in _init
        raise NoSuchProcess(pid, msg='process PID not found')
    psutil.NoSuchProcess: process PID not found (pid=11107)

[00:01:24][        ][    main:core activity                 ] WARNING Build Terminated

Pipeline Summary
    Total:       1
    Session:     1
    Fetch Queue: processed 0, skipped 1, failed 0 
    Build Queue: processed 0, skipped 0, failed 0 

This was just trying to build the bottom bootstrap in FD-SDK at the fetch stage

Which IIUC is not using buildbox..

As the boot strap is quite big this would have hung for a long time so i had to go and kill ostree with htop.

abderrahim commented 2 years ago

I noticed something about git too (mentioned in the issue description) so it's definitely not buildbox related. This most likely something with the threaded scheduler.

nanonyme commented 2 years ago

Looks like process already exited when it was tried to be terminated? Oh yeah, right, the trace comes because it was externally killed.

fishrockz commented 2 years ago

I am having issues with terminating still, this bug still looks open.