Closed asomers closed 8 years ago
Is this with parallel execution enabled?
Has this just started happening? Note that 0.12 has been out for a while already, so if this is a new issue, I suspect it was triggered by an environment change.
At $WORK, it started happening as soon as we upgraded to Kyua 0.12. On the FreeBSD Jenkins server, which I linked to, it happened as far back as Feb 2, and there's no earlier history. Here at $WORK, we didn't explicitly set parallelism in kyua.conf. It defaults to serial execution, right?
Correct; the default is still serial.
OK, so if it's with the upgrade then my theory on an environment change is unfounded.
I assume this happens when running the full FreeBSD test suite, right?
You also mention "under an hour" but in my experience FreeBSD test runs are faster than that. Is this within a VM maybe?
More details about the upgrade: we're running a custom version of FreeBSD that was forked from stable/10 at version 277174. The crash happened when we tried a general ports upgrade, which included upgrading Kyua from 0.11 to 0.12, but also included upgrading many other ports. Given kyua's small list of dependencies and the location of the stack trace, I'm skeptical that any of those other ports could be responsible. But for completeness's sake, here's the full list of kyua's dependent ports and their versions: kyua 0.11 -> 0.12_1 sqlite3 3.8.10.1 -> 3.10.2_1 atf-0.21 -> 0.21 (no change) lutok-0.4_6 -> 0.4_6 (no change) pkgconf-0.9.7 > 0.9.12_1
The crash happens during a full test of the FreeBSD test suite, plus our proprietary tests. In particular, the test run includes everything from https://svnweb.freebsd.org/base/projects/zfsd/head/tests/sys/cddl/zfs/tests/ . All together, these test runs last about 5 hours. The crash can happen during any test. Usually it happens during one of the ZFS tests, simply because they take the most time. But it's also happened during the pw(8) tests. We're running on bare metal on a 2 socket Haswell system.
Since this is happening in the FreeBSD Jenkins I'd like to revert the upgrade to 0.12 from the FreeBSD port until this gets fixed.
Reverting the port sounds reasonable. I may be able to look at it later today, but if you have an earlier chance, feel free to proceed.
This is very strange. I've been looking at the code and I cannot tell why erase()
on that line would fail; what's doing is not complicated and I cannot see why the iterator would be invalid. Also, I cannot reproduce this either so it'll be hard to track down.
It'd be that the pid_t
to int
conversion is messing things up, though I don't think so because pid_t
is defined as int32_t
on amd64.
I'm also looking at running a full test run with Valgrind to see if that brings some light (of any form).
Valgrind reported nothing interesting. Also tried an AddressSanitizer-enabled build and, while this reports a bunch of issues in test programs (most of which come from atf-c++
), the main kyua
binary didn't report any.
@jmmv Thank you for researching it, but I would like to go ahead and roll the port back. Is that OK with you?
Yes. Sorry I didn't get to doing this earlier...
No worries, thank you!
Excellent. I have been finally able to reproduce this locally: created a fake test suite composed of symlinks to the Kyua tests 100 times, and then ran kyua test
on it. The crash comes after about 10 minutes in this case, with the same stack trace as above.
Thanks @jmmv, I will try and get this commit imported to test it on our local repo.
Do you have any thoughts on when you will do a release with this in it?
If you can test it, great!
Regarding a new release: soon, likely. I would like to look at a couple more bug reports (like the junit one) before doing one but won't add any new features into it.
Kyua 0.12 on FreeBSD HEAD or 10.1 segfaults in drivers/run_tests.cpp at line 297. It happens in a different test program each time, but it usually segfaults after about an hour.
Here's an example of the console output from a crash:
(original source) https://jenkins.freebsd.org/job/FreeBSD_HEAD/157/console
And here's a stack trace:
I can provide a core file, if needed. And I can also test patches or help diagnose the problem, if you can suggest anything.