freebsd / kyua

Port/Package build and test system
https://github.com/freebsd/kyua/wiki
BSD 3-Clause "New" or "Revised" License
149 stars 42 forks source link

segfault in drivers::run_tests::drive at line 297 #156

Closed asomers closed 8 years ago

asomers commented 8 years ago

Kyua 0.12 on FreeBSD HEAD or 10.1 segfaults in drivers/run_tests.cpp at line 297. It happens in a different test program each time, but it usually segfaults after about an hour.

Here's an example of the console output from a crash:

[192.168.10.2] out: local/kyua/integration/cmd_list_test:one_arg__subdir  ->  passed  [0.160s]
[192.168.10.2] out: local/kyua/integration/cmd_list_test:one_arg__test_case  ->  passed  [0.104s]
[192.168.10.2] out: local/kyua/integration/cmd_list_test:one_arg__test_program  ->  passed  [0.084s]
[192.168.10.2] out: local/kyua/integration/cmd_list_test:only_load_used_test_programs  ->  *** Fatal signal 11 received
[192.168.10.2] out: *** Log file is /root/.kyua/logs/kyua.20160210-175558.log
[192.168.10.2] out: *** Please report this problem to kyua-discuss@googlegroups.com detailing what you were doing before the crash happened; if possible, include the log file mentioned above
[192.168.10.2] out: Segmentation fault (core dumped)
[192.168.10.2] out: 

(original source) https://jenkins.freebsd.org/job/FreeBSD_HEAD/157/console

And here's a stack trace:

#0  0x0000000000412d55 in std::__1::__tree_min<std::__1::__tree_node_base<void*>*> (__x=0x1) at /usr/include/c++/v1/__tree:134
#1  std::__1::__tree_next<std::__1::__tree_node_base<void*>*> (
    __x=0x7fffffffb7f0) at /usr/include/c++/v1/__tree:158
#2  0x0000000000472272 in std::__1::__tree_iterator<std::__1::__value_type<int, long>, std::__1::__tree_node<std::__1::__value_type<int, long>, void*>*, long>::operator++ (this=0x7fffffffb478) at /usr/include/c++/v1/__tree:656
#3  std::__1::__tree<std::__1::__value_type<int, long>, std::__1::__map_value_compare<int, std::__1::__value_type<int, long>, std::__1::less<int>, true>, std::__1::allocator<std::__1::__value_type<int, long> > >::erase (
    this=0x7fffffffb7e8, __p=...) at /usr/include/c++/v1/__tree:1978
#4  0x0000000000470619 in std::__1::__map_const_iterator<std::__1::__tree_const_iterator<std::__1::__value_type<int, long>, std::__1::__tree_node<std::__1::__value_type<int, long>, void*>*, long> >::__map_const_iterator (
    this=0x7fffffffb7e8, this=0x7fffffffb7e8, __i=[43985152] 81, 
    __p=[43985152] 81) at /usr/include/c++/v1/map:1067
#5  drivers::run_tests::drive (kyuafile_path=..., build_root=..., 
    store_path=..., filters=empty std::__1::set, user_config=..., hooks=...)
    at drivers/run_tests.cpp:297
#6  0x0000000000457605 in cli::cmd_test::run (this=0x802822880, ui=
    0x7fffffffe4d8, cmdline=..., user_config=...) at cli/cmd_test.cpp:158
#7  0x0000000000413354 in utils::cmdline::base_command<utils::config::tree>::main (this=0x802822880, ui=0x7fffffffe4d8, 
    args=std::__1::vector (length=1, capacity=1) = {...}, data=...)
    at ./utils/cmdline/base_command.ipp:96
#8  0x000000000040d57a in (anonymous namespace)::run_subcommand (
    ui=0x7fffffffe4d8, command=0x802822880, 
    args=std::__1::vector (length=1, capacity=1) = {...}, user_config=...)
    at cli/main.cpp:139
#9  0x000000000040c30b in (anonymous namespace)::safe_main (ui=0x7fffffffe4d8, 
    argc=3, argv=0x7fffffffeb18, mock_command=...) at cli/main.cpp:228
#10 0x0000000000408ed2 in cli::main (ui=0x7fffffffe4d8, argc=3, 
    argv=0x7fffffffeb18, mock_command=...) at cli/main.cpp:280
#11 0x000000000040cd74 in cli::main (argc=3, argv=0x7fffffffeb18)
    at cli/main.cpp:353
#12 0x0000000000408742 in main (argc=3, argv=0x7fffffffeb18) at main.cpp:49

I can provide a core file, if needed. And I can also test patches or help diagnose the problem, if you can suggest anything.

jmmv commented 8 years ago

Is this with parallel execution enabled?

Has this just started happening? Note that 0.12 has been out for a while already, so if this is a new issue, I suspect it was triggered by an environment change.

asomers commented 8 years ago

At $WORK, it started happening as soon as we upgraded to Kyua 0.12. On the FreeBSD Jenkins server, which I linked to, it happened as far back as Feb 2, and there's no earlier history. Here at $WORK, we didn't explicitly set parallelism in kyua.conf. It defaults to serial execution, right?

jmmv commented 8 years ago

Correct; the default is still serial.

OK, so if it's with the upgrade then my theory on an environment change is unfounded.

I assume this happens when running the full FreeBSD test suite, right?

You also mention "under an hour" but in my experience FreeBSD test runs are faster than that. Is this within a VM maybe?

asomers commented 8 years ago

More details about the upgrade: we're running a custom version of FreeBSD that was forked from stable/10 at version 277174. The crash happened when we tried a general ports upgrade, which included upgrading Kyua from 0.11 to 0.12, but also included upgrading many other ports. Given kyua's small list of dependencies and the location of the stack trace, I'm skeptical that any of those other ports could be responsible. But for completeness's sake, here's the full list of kyua's dependent ports and their versions: kyua 0.11 -> 0.12_1 sqlite3 3.8.10.1 -> 3.10.2_1 atf-0.21 -> 0.21 (no change) lutok-0.4_6 -> 0.4_6 (no change) pkgconf-0.9.7 > 0.9.12_1

The crash happens during a full test of the FreeBSD test suite, plus our proprietary tests. In particular, the test run includes everything from https://svnweb.freebsd.org/base/projects/zfsd/head/tests/sys/cddl/zfs/tests/ . All together, these test runs last about 5 hours. The crash can happen during any test. Usually it happens during one of the ZFS tests, simply because they take the most time. But it's also happened during the pw(8) tests. We're running on bare metal on a 2 socket Haswell system.

brd commented 8 years ago

Since this is happening in the FreeBSD Jenkins I'd like to revert the upgrade to 0.12 from the FreeBSD port until this gets fixed.

jmmv commented 8 years ago

Reverting the port sounds reasonable. I may be able to look at it later today, but if you have an earlier chance, feel free to proceed.

jmmv commented 8 years ago

This is very strange. I've been looking at the code and I cannot tell why erase() on that line would fail; what's doing is not complicated and I cannot see why the iterator would be invalid. Also, I cannot reproduce this either so it'll be hard to track down.

It'd be that the pid_t to int conversion is messing things up, though I don't think so because pid_t is defined as int32_t on amd64.

I'm also looking at running a full test run with Valgrind to see if that brings some light (of any form).

jmmv commented 8 years ago

Valgrind reported nothing interesting. Also tried an AddressSanitizer-enabled build and, while this reports a bunch of issues in test programs (most of which come from atf-c++), the main kyua binary didn't report any.

brd commented 8 years ago

@jmmv Thank you for researching it, but I would like to go ahead and roll the port back. Is that OK with you?

https://people.freebsd.org/~brd/patches/kyua-rollback.diff

jmmv commented 8 years ago

Yes. Sorry I didn't get to doing this earlier...

brd commented 8 years ago

No worries, thank you!

jmmv commented 8 years ago

Excellent. I have been finally able to reproduce this locally: created a fake test suite composed of symlinks to the Kyua tests 100 times, and then ran kyua test on it. The crash comes after about 10 minutes in this case, with the same stack trace as above.

brd commented 8 years ago

Thanks @jmmv, I will try and get this commit imported to test it on our local repo.

Do you have any thoughts on when you will do a release with this in it?

jmmv commented 8 years ago

If you can test it, great!

Regarding a new release: soon, likely. I would like to look at a couple more bug reports (like the junit one) before doing one but won't add any new features into it.