borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
11.13k stars 741 forks source link

test runs hang frequently at the end #4350

Closed ThomasWaldmann closed 5 years ago

ThomasWaldmann commented 5 years ago

happens in 1.1-maint branch (e.g. 1.1.9), not sure about other branches.

tox runs on travis-ci and also in the vagrant machines frequently, but not always hang at the end (short before the test results are output). high cpu, no output.

could be related to coverage extension maybe?

also check if some version upgrade of testing tools / libs is causing this.

happens since ~4 weeks or so.

ThomasWaldmann commented 5 years ago

After it hung, I hit Ctrl-C once:

$ py.test --fulltrace -n {env:XDISTN:16} -rs --benchmark-skip --pyargs {posargs:borg.testsuite}
................................s..s.................s...............................s......................s.......... [  8%]
..............sssssssss.......ssssssssss.......x...........sssssssss..ssssssssssssssssssss....ssssssssss.sssssss.ss.... [ 17%]
ssssss.ssss..sssssssss...ssssssssss...............................................ssssssss.s..sssssssssssssssssssssssss [ 25%]
sssss.sss.sssss.ss.ss.ss................................................sssssssss.....s..............................s. [ 34%]
ssss.ssss.....s.s.s..s.s...s..s..s.s.s.s.....................s...s.s.s................................................. [ 43%]
....................................................................................................................... [ 51%]
...............................ss.ss.sssss............................................................................. [ 60%]
....................................................................................................................... [ 69%]
.s....s......................................................................................sssssss.s.ss.............. [ 77%]
....................................................................................................................... [ 86%]
.........................................................s......s....s...................s............................. [ 95%]
...........................................x.......
^CERROR: KEYBOARDINTERRUPT

=================================================== short test summary info ===================================================
SKIPPED [3] .tox/py36/lib/python3.6/site-packages/borg/testsuite/archiver.py:509: cannot properly setup and execute test without birthtime
SKIPPED [1] .tox/py36/lib/python3.6/site-packages/borg/testsuite/archiver.py:181: needs a 16MB fs mounted on /tmp/borg-mount
SKIPPED [2] .tox/py36/lib/python3.6/site-packages/borg/testsuite/archiver.py:1238: Linux capabilities test, requires fakeroot >= 1.20.2
SKIPPED [1] .tox/py36/lib/python3.6/site-packages/borg/testsuite/archiver.py:1259: xattr not supported on this system or on this version offakeroot
SKIPPED [3] .tox/py36/lib/python3.6/site-packages/borg/testsuite/archiver.py:525: cannot properly setup and execute test without birthtime
SKIPPED [150] /home/tw/w/borg/.tox/py36/lib/python3.6/site-packages/_pytest/unittest.py:93: no borg.exe available
SKIPPED [49] /home/tw/w/borg/.tox/py36/lib/python3.6/site-packages/pytest_benchmark/plugin.py:396: Benchmarks are skipped (--benchmark-skip was used).
SKIPPED [1] .tox/py36/lib/python3.6/site-packages/borg/testsuite/archiver.py:3345: only works locally
SKIPPED [1] .tox/py36/lib/python3.6/site-packages/borg/testsuite/platform.py:113: requires übel user
SKIPPED [1] /home/tw/w/borg/.tox/py36/lib/python3.6/site-packages/_pytest/unittest.py:93: OS X only test
SKIPPED [1] .tox/py36/lib/python3.6/site-packages/borg/testsuite/archiver.py:3341: only works locally
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

config = <_pytest.config.Config object at 0x7f98888897b8>, doit = <function _main at 0x7f9888d64400>

    def wrap_session(config, doit):
        """Skeleton command line program"""
        session = Session(config)
        session.exitstatus = EXIT_OK
        initstate = 0
        try:
            try:
                config._do_configure()
                initstate = 1
                config.hook.pytest_sessionstart(session=session)
                initstate = 2
>               session.exitstatus = doit(config, session) or 0

.tox/py36/lib/python3.6/site-packages/_pytest/main.py:209: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

config = <_pytest.config.Config object at 0x7f98888897b8>, session = <Session borg>

    def _main(config, session):
        """ default command line protocol for initialization, session,
        running tests and reporting. """
        config.hook.pytest_collection(session=session)
>       config.hook.pytest_runtestloop(session=session)

.tox/py36/lib/python3.6/site-packages/_pytest/main.py:249: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_HookCaller 'pytest_runtestloop'>, args = (), kwargs = {'session': <Session borg>}, notincall = set()

    def __call__(self, *args, **kwargs):
        if args:
            raise TypeError("hook calling supports only keyword arguments")
        assert not self.is_historic()
        if self.spec and self.spec.argnames:
            notincall = (
                set(self.spec.argnames) - set(["__multicall__"]) - set(kwargs.keys())
            )
            if notincall:
                warnings.warn(
                    "Argument(s) {} which are declared in the hookspec "
                    "can not be found in this hook call".format(tuple(notincall)),
                    stacklevel=2,
                )
>       return self._hookexec(self, self.get_hookimpls(), kwargs)

.tox/py36/lib/python3.6/site-packages/pluggy/hooks.py:289: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_pytest.config.PytestPluginManager object at 0x7f9888d15710>, hook = <_HookCaller 'pytest_runtestloop'>
methods = [<HookImpl plugin_name='main', plugin=<module '_pytest.main' from '/home/tw/w/borg/.tox/py36/lib/python3.6/site-packag...9884d0ca58>>, <HookImpl plugin_name='logging-plugin', plugin=<_pytest.logging.LoggingPlugin object at 0x7f9884d0cc18>>]
kwargs = {'session': <Session borg>}

    def _hookexec(self, hook, methods, kwargs):
        # called from all hookcaller instances.
        # enable_tracing will set its own wrapping function at self._inner_hookexec
>       return self._inner_hookexec(hook, methods, kwargs)

.tox/py36/lib/python3.6/site-packages/pluggy/manager.py:68: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

hook = <_HookCaller 'pytest_runtestloop'>
methods = [<HookImpl plugin_name='main', plugin=<module '_pytest.main' from '/home/tw/w/borg/.tox/py36/lib/python3.6/site-packag...9884d0ca58>>, <HookImpl plugin_name='logging-plugin', plugin=<_pytest.logging.LoggingPlugin object at 0x7f9884d0cc18>>]
kwargs = {'session': <Session borg>}

    self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
        methods,
        kwargs,
>       firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
    )

.tox/py36/lib/python3.6/site-packages/pluggy/manager.py:62: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <xdist.dsession.DSession object at 0x7f9884d0ca58>

    def pytest_runtestloop(self):
        self.sched = self.config.hook.pytest_xdist_make_scheduler(
            config=self.config, log=self.log
        )
        assert self.sched is not None

        self.shouldstop = False
        while not self.session_finished:
>           self.loop_once()

.tox/py36/lib/python3.6/site-packages/xdist/dsession.py:115: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <xdist.dsession.DSession object at 0x7f9884d0ca58>

    def loop_once(self):
        """Process one callback from one of the workers."""
        while 1:
            if not self._active_nodes:
                # If everything has died stop looping
                self.triggershutdown()
                raise RuntimeError("Unexpectedly no active workers available")
            try:
>               eventcall = self.queue.get(timeout=2.0)

.tox/py36/lib/python3.6/site-packages/xdist/dsession.py:129: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <queue.Queue object at 0x7f9884bc9438>, block = True, timeout = 2.0

    def get(self, block=True, timeout=None):
        '''Remove and return an item from the queue.

        If optional args 'block' is true and 'timeout' is None (the default),
        block if necessary until an item is available. If 'timeout' is
        a non-negative number, it blocks at most 'timeout' seconds and raises
        the Empty exception if no item was available within that time.
        Otherwise ('block' is false), return an item if one is immediately
        available, else raise the Empty exception ('timeout' is ignored
        in that case).
        '''
        with self.not_empty:
            if not block:
                if not self._qsize():
                    raise Empty
            elif timeout is None:
                while not self._qsize():
                    self.not_empty.wait()
            elif timeout < 0:
                raise ValueError("'timeout' must be a non-negative number")
            else:
                endtime = time() + timeout
                while not self._qsize():
                    remaining = endtime - time()
                    if remaining <= 0.0:
                        raise Empty
>                   self.not_empty.wait(remaining)

/usr/lib/python3.6/queue.py:173: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Condition(<unlocked _thread.lock object at 0x7f9884cc9300>, 0)>, timeout = 1.9999981840001055

    def wait(self, timeout=None):
        """Wait until notified or until a timeout occurs.

        If the calling thread has not acquired the lock when this method is
        called, a RuntimeError is raised.

        This method releases the underlying lock, and then blocks until it is
        awakened by a notify() or notify_all() call for the same condition
        variable in another thread, or until the optional timeout occurs. Once
        awakened or timed out, it re-acquires the lock and returns.

        When the timeout argument is present and not None, it should be a
        floating point number specifying a timeout for the operation in seconds
        (or fractions thereof).

        When the underlying lock is an RLock, it is not released using its
        release() method, since this may not actually unlock the lock when it
        was acquired multiple times recursively. Instead, an internal interface
        of the RLock class is used, which really unlocks it even when it has
        been recursively acquired several times. Another internal interface is
        then used to restore the recursion level when the lock is reacquired.

        """
        if not self._is_owned():
            raise RuntimeError("cannot wait on un-acquired lock")
        waiter = _allocate_lock()
        waiter.acquire()
        self._waiters.append(waiter)
        saved_state = self._release_save()
        gotit = False
        try:    # restore state no matter what (e.g., KeyboardInterrupt)
            if timeout is None:
                waiter.acquire()
                gotit = True
            else:
                if timeout > 0:
>                   gotit = waiter.acquire(True, timeout)
E                   KeyboardInterrupt

/usr/lib/python3.6/threading.py:299: KeyboardInterrupt
ThomasWaldmann commented 5 years ago

note: pytest-timeout package can make it time out instead of hang. pytest --timeout=900

fantasya-pbem commented 5 years ago

Note: In Travis CI my recent PR #4477 build on Python 3.6 had this issue too (only 3.6).

ThomasWaldmann commented 5 years ago

@fantasya-pbem it happens very frequently, but root cause is unknown yet.

ThomasWaldmann commented 5 years ago

I recently tried downgrading misc. packages (pytest, pluggy, pytest-xdist), but it didn't help.

ThomasWaldmann commented 5 years ago

added -v and used XDISTN with 1, this resulted in:

[gw0] [ 29%] PASSED .tox/py36/lib/python3.6/site-packages/borg/testsuite/archiver.py::RemoteArchiverTestCase::test_export_tar_strip_components_links 
.tox/py36/lib/python3.6/site-packages/borg/testsuite/archiver.py::RemoteArchiverTestCase::test_extract_capabilities 
[gw0] [ 29%] SKIPPED .tox/py36/lib/python3.6/site-packages/borg/testsuite/archiver.py::RemoteArchiverTestCase::test_extract_capabilities 
.tox/py36/lib/python3.6/site-packages/borg/testsuite/archiver.py::RemoteArchiverTestCase::test_extract_hardlinks 

It seems to always (? or at least mostly) hang at RemoteArchiverTestCase::test_extract_hardlinks.

Seen it for py35 and py36.

ThomasWaldmann commented 5 years ago

Had a look at that test, looks mostly harmless.

For now, I just merged #4481 into master and we'll see better where it is hanging.

ThomasWaldmann commented 5 years ago

master branch at 6dd14914b339ebabc42fb4856834c1a35ea2aa26,

pytest -v -k "Remote and test_extract_hardlinks"

always hangs in test_extract_hardlinks. Ctrl-C once gives a traceback.

ThomasWaldmann commented 5 years ago

after splitting the hardlinks extraction test into 2, it became clear the issue only happens for partial extraction.

ThomasWaldmann commented 5 years ago

it's a bug. the chunks needed for hardlink slaves are not preloaded (if the hardlink master is not also extracted), but the code later expects that they are, which makes it hang.

some details while trying to fix this:

most chunks preloaded in: Archiver.do_extract via archive.iter_items(filter, preload=True) (but this does not preload unmatched hardlink master chunks [bug])

code in extract_helper fixes up item.chunks list for hardlink slaves without an already extracted hardlink master.

expects preloaded chunks: Archive.extract_item when calling self.pipeline.fetch_many(ids, is_preloaded=True).

ThomasWaldmann commented 5 years ago

master fix: #4539

backport for 1.1-maint: #4542