braddr / d-tester

Automated testing for github projects.
http://d.puremagic.com/test-results/
11 stars 5 forks source link

Can't reproduce Segfault on Darwin_64_32 #63

Closed marler8997 closed 6 years ago

marler8997 commented 6 years ago

I'm getting a segfault on the auto-tester (https://auto-tester.puremagic.com/show-run.ghtml?projectid=1&runid=2988767&isPull=true) on Darwin_64_32. I've moved to a macbook pro to try to reproduce so I could fix the issue but have been unsuccessful. I've made multiple attempts of going through the logs trying to reproduce the steps the auto-tester might be taking but I still can't reproduce the segfault. Is there a script that the auto-tester is running that I could use to reproduce?

braddr commented 6 years ago

Look in the braddr/at-client repository for the entirely of the tester host code. I notice that it looks like it might be a 32 bit specific issue as the Darwin_64_64 target passed. The host used for that build is an older version of osx. You can see the versions of the tools used in the output of the build dmd step.

marler8997 commented 6 years ago

Yes I noticed that to. I'll look through that at-client code. At the same time I pushed a temporary commit that enables some extra logging in the build so maybe that will be good enough to root cause the issue without having to reproduce locally.

marler8997 commented 6 years ago

Oh no, with my small change to logging it's not segfaulting anymore!!! NO!!!

marler8997 commented 6 years ago

I wasn't able to reproduce on my macbook. I've been slowly changing one line and pushing the changes and waiting for the autotester to run Darwin_64_32. It's a slow going process. Is there a way to speed this up? Is there a way to configure a particular PR to only run certain platforms or have priority?

marler8997 commented 6 years ago

This is crazy, I've been changing one thing at a time trying to figure out how to work around this segfault. I have no idea what's causing it and the most random changes will either avoid it or cause it to come back!

wilzbach commented 6 years ago

Is there a way to speed this up? Is there a way to configure a particular PR to only run certain platforms or have priority?

IIRC PRs against stable have a higher priority in the queue.

Remove the win{32,64}.mak files - it will fail in the first stage when trying to apply the patch. To abort on the other platform here's a hack. Add this to the top of the root posix.mak:

ifneq (Darwin,$(shell uname -s))
 $(error Not OSX. Abort.)
endif

We have/had an odd DScanner segfault at Phobos recently, gaining the stacktrace was pretty helpful. Have a look at: https://github.com/dlang/phobos/blob/master/posix.mak#L558

It boils down to:

gdb -q -ex run -ex bt -batch --args <program-with-args>
marler8997 commented 6 years ago

Thanks for the helpful suggestions, I'll play with these

wilzbach commented 6 years ago

FYI when you reproduce this locally you probably should grab 2.068.2 archives:

==== Toolchain Information ====
uname -a: Darwin D-Autotester.local 13.4.0 Darwin Kernel Version 13.4.0: Mon Jan 11 18:17:34 PST 2016; root:xnu-2422.115.15~1/RELEASE_X86_64 x86_64
MAKE(make): GNU Make 3.81 Copyright (C) 2006 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. This program built for i386-apple-darwin11.3.0
SHELL(/bin/sh): GNU bash, version 3.2.53(1)-release (x86_64-apple-darwin13) Copyright (C) 2007 Free Software Foundation, Inc.
HOST_DMD(/Users/braddr/sandbox/at-client/release-build/install/osx/bin/dmd): DMD64 D Compiler v2.068.2 Copyright (c) 1999-2015 by Digital Mars written by Walter Bright
HOST_CXX(c++): Apple LLVM version 6.0 (clang-600.0.57) (based on LLVM 3.5svn) Target: x86_64-apple-darwin13.4.0 Thread model: posix
@(#)PROGRAM:ld  PROJECT:ld64-241.9
marler8997 commented 6 years ago

Hmmm, does the autotester stop running tests if one of the platforms fail?

wilzbach commented 6 years ago

Hmmm, does the autotester stop running tests if one of the platforms fail?

Nope, on the contrary, it will try to restart failing tests whenever free resources are available. Also it will automatically invalidate all jobs whenever a new commit gets merged at dmd, druntime or phobos.

marler8997 commented 6 years ago

hmmm, can't seem to find gdb or ddb

marler8997 commented 6 years ago

Adding find / -name gdb to one of the targets :)

Ugh, this is so tedious

wilzbach commented 6 years ago

hmmm, can't seem to find gdb or ddb

I think to remember that a couple of tests required a debugger to catch the stack trace, but apparently that's Linux only: https://github.com/dlang/dmd/blob/master/test/d_do_test.d#L681

marler8997 commented 6 years ago

If I push a change that causes other platforms to fail, it seems that the autotester stops testing the PR all together...

marler8997 commented 6 years ago

@braddr @wilzbach

I've finally been able to reproduce the bug on my macbook.

It appears the segfault only occurs if I use the 2.068.2 compiler. The stack frame for the segfault is as follows:

Obj::term(char const*) + 2285
obj_end(Library*, File*) + 37
tryMain(unsigned long, char const**) + 12066
_start + 203
start + 33

If I upgrade the compiler to the next version 2.069.0 then the problem goes away. Would we be able to upgrade the autotester to this version? If not, maybe we can patch version 2.068.2 with whatever fix went into the next version of dmd? I'm not sure if dlang supports patches to older versions or not.

braddr commented 6 years ago

Not without changing the official minimum compiler for building dmd. That's not my call. Take it back to the community and leadership team. I'm going to resolve this issue as it's not a problem with the tester.

marler8997 commented 6 years ago

Sorry I'm new to the autotester and how things work so I had a few questions. What exactly do you mean when you say "official minimum" compiler? Is that determined by an email between you and the leadership team, or is it published somewhere? In other words, if the leadership team is ready to change the official minimum compiler, how does that decision get published and make it's way to you, the implementor?

The next question I have is how the minimum compiler is defined. Is it defined as a specific set of bits or maybe a major/minor release version? i.e. mabye it's defined as 2.068, so the minimum could mean any version 2.068.x, or maybe it's a specific version of bits on the download page for each platform? Do we support "patches" to the minimum compiler or are their bits forever locked?

Thanks in advance for your help and patience.

marler8997 commented 6 years ago

Since the same seg fault appears to be happening on 2 PR's now it's making me think that it's unlikely this bug has been in the autotester for a long time. @braddr , do you recall how long the autotester has been using it's current version of the 2.068.2 compiler? Has it been updated or modified recently?

braddr commented 6 years ago

many years.

marler8997 commented 6 years ago

Did you have a second to answer the other questions as well @braddr ?

braddr commented 6 years ago

I don't know where it's documented these days, but the version required to be able to bootstrap dmd. See also a recent thread on the forums about this exact topic. ANY change of version, regardless of the nuances of which component of the version number matters here. I'm in no way saying it can't be changed, just that changing it on the auto-tester fleet is the very last and probably easiest step.

marler8997 commented 6 years ago

cool thanks for the info. It sounds like in the future that dmd-cxx will likely be the long term bootstrap compiler. Anyway, I've got a workaround for my PR but this segfault is affecting another PR so well see if they can find a workaround as well, or if they'll have to wait till we bump the minimum version again.