Unidata / netcdf-c

Official GitHub repository for netCDF-C libraries and utilities.
BSD 3-Clause "New" or "Revised" License
501 stars 262 forks source link

Travis CI Instability #1455

Open WardF opened 5 years ago

WardF commented 5 years ago

We are seeing a lot of temporary (apparently) breakage in TravisCI. This is causing problems managing pull requests; manually restarting failing tests is time consuming, and there's no reason to need to do this in the first place, especially since the failures are seldom reproducible locally.

The issues tend to be centered on remote tests. I posit the issue is either an instability in the Travis infrastructure, instability in the remote test server, or instability in the Docker configuration used for the Travis tests. I'm working on trying to narrow this down, so that we can get the 4.7.1 release out the door and move on with the C++ and Fortran releases.

WardF commented 5 years ago

It's worth noting that the issues don't appear to show up in the Jenkins tests run by @edhartnett, nor do they show up on the private Jenkins install I've set up.

edwardhartnett commented 5 years ago

I use --disable-dap-remote-tests in many of my CI builds on Jenkins. If I have more than a few jobs trying to run DAP tests at the same time, I frequently get a timeout on tst_remote.sh.

So perhaps try that on a bunch of the travis test cases, and let only one of them actually to DAP remote tests. If that fixes it, you can expand it to 2 or 3.

(Or is there a way to lengthen the timeout of the tests in tst_remote.sh? Or is it just one test in tst_remote,sh, which is getting a particularly big chunk of data?)

DennisHeimbigner commented 5 years ago

You can increase the timeout by creating a file called .dodsrc using e.g. cat >.dodsrc <<EOF HTTP.TIMEOUT=dddd EOF where dddd is the timeout in seconds.

WardF commented 5 years ago

If the issue turns out to be a timeout, that would make sense to address it as described, although we are only running a small number of concurrent tests. I'll be really dismayed if 4 people running 'make check' at once is enough to cause problems with our test servers.

WardF commented 5 years ago

Thanks @DennisHeimbigner. It occured to me "Hey, actually, I thought I disabled DAP testing in travis" so now I'm reviewing how the test scripts parse the options passed to Docker at test time.