ctdk / goiardi

A Chef server written in Go, able to run entirely in memory, with optional persistence with saving the in-memory data to disk or using MySQL or Postgres as the data storage backend. Docs: http://goiardi.readthedocs.io/en/latest/index.html
http://goiardi.gl
Apache License 2.0
280 stars 39 forks source link

[go1.5] Intermittent / Strange Test Failures in Indexer Package #42

Closed theckman closed 8 years ago

theckman commented 8 years ago

Today while working a change, I noticed that I couldn't get my tests to pass. That included a fresh checkout from the master branch. I reached out on Twitter, and managed to dig in a little bit. I figured I'd just open this so you won't need to look on Twitter to find the details.

I'm running Mac OS X 10.10.5 with the following Go runtime: go version go1.5.1 darwin/amd64. My testing of these issues did include recompiling Go 1.5.1, with no change in behavior.

Here is the Twitter thread for reference:

The first issue I hit was this test failure here:

Based on the stacktrace and the code it points to, it seems like there is a sync.RWMutex.RLock() function call that doesn't get the lock within 10 minutes:

I believe that code gets transitively called from here:

After I hit that issue, I then ran go get -u github.com/ctdk/goiardi and the next test failed like this:

It looks like none of the mocked information made it to the search index. I'm only opening one issue because these two things seem like they may be related (to the untrained eye).

theckman commented 8 years ago

I just had a thought about an issue I had with goiardi the other day on my workstation. Internally at my company, we have a knife plugin for uploading the different things to our Chef server. I tried to run the sync script and I believe it failed during this API call:

Chef::CookbookVersion.load(<cookbook>).manifest

I recently started wanting to keep state between restarts, so I started using the IndexFile. Before that, I hadn't had any issues. I wonder if there's some weird bug that's manifesting itself on my system that's showing itself in the tests? Again, untrained eye so I could just misunderstand how the pieces come together.

I'm not sure if it's helpful to mention this, but after upgrading to go1.5 I noticed that unit tests in my personal projects became quite a bit more racy.

ctdk commented 8 years ago

Another strong possibility is that something I did for getting the postgres search to work is screwing this up. I've been running chef-pedant against it this evening, but it's getting weird broken pipe errors that I need to deal with.

The version thing is an interesting notion, though; 0.10.0 was build with go 1.4.2, and it was passing all of the tests fine at that point (except for some minor tests mentioned in the very next commit). That would explain an awful lot, actually - I'll try falling back and see if it makes a difference. I think circleci is using 1.4.2 as well still, which would explain why the only failures I was getting were the annoying search ones I mentioned on Twitter. Interesting...

ctdk commented 8 years ago

Well, I feel a little less crazy. I reverted back to go 1.4.2, got goiardi rebuilt, and the tests pass (along with the formerly breaking chef-pedant errors), so I think it's something weird about 1.5 going on.

Unfortunately now my go install's all messed up and I can't get it to actually get any changes to the goiardi modules to actually be reflected in the binary when rebuilding. I think it's just that my environment's messed up from downgrading or something, but that does mean that I couldn't actually get this PR tested yet. :-/ Still, it's something.

theckman commented 8 years ago

I have a "fix" for you. Run the tests with the -p 1 flag. I think the issue is due to GOMAXPROCS being increased by default on go1.5+. Fixing the tests to avoid needing the -p flag sounds like it could be a giant rabbit hole.

Thinking about it more, if it's causing tests to break does that mean similar failure conditions can exist in runtime operation?

Log:

ctdk commented 8 years ago

Interesting. I've run goiardi with GOMAXPROCS > 1 before without ill effect, but that could be pure luck. This definitely bears investigating. Time to un-break my golang installation before anything else though.

ctdk commented 8 years ago

Something's definitely a little different between go 1.4 and go 1.5 here. The tests pass fine with go 1.4. but the in-mem index breaks weirdly with 1.5. (chef-pedant also revealed problems with the tests that send very large amounts of data to the server with 1.5, so that's another pretty serious issue.)

Fun stuff, but thanks so much for finding this.

theckman commented 8 years ago

I always seem to find the fun problems. :) I'll just install the tagged version of Goiard for running, and keep master(ish) in my GOPATH for library usage.

Would you like me to make any changes to the title of the issue to indicate it's a go1.5 issue?

ctdk commented 8 years ago

Sure, marking it as 1.5 specific seems like a good idea.

theckman commented 8 years ago

Alrighty, I've added it as a prefix to the title.

ctdk commented 8 years ago

One item of progress: I finally duplicated that search error with the improper number of results returned that had been bedeviling me in circleci. I just had to up GOMAXPROCS from 1.

theckman commented 8 years ago

That's great news. I have a weird feeling that the issues we are seeing are directly related to those random failures in CircleCI. They just manifest differently based on the order of operations happening internally within Goiardi. Once the code the test asserts is fixed, I hope it'll fix the other weirdness we are seeing.

ctdk commented 8 years ago

I've been neglecting the in-memory search a bit anyway, since I've been focused (as much as I've been able to do much with goiardi, which hasn't been as much as I'd necessarily like) on the postgres search. That's pretty nice, actually, although it's not fire-and-forget. I learned the hard way that it needs more maintenance than expected.

Fortunately that maintenance is just periodic reindexing, which isn't too bad at all.

ctdk commented 8 years ago

I learned why the search test failed with GOMAXPROCS > 1 - it's because since more than one goroutine can run at once, and the function that saves each item to the index is run in a goroutine to keep from having to make the caller sit around and wait for the indexing to finish in the normal case, the test keeps running. Since the test happens so quickly after the objects are indexed, the index hasn't been fully updated by the time the tests start running.

A simple sleep causes them to pass, but that's obviously not the right way to do it. It shouldn't be too hard though.

ctdk commented 8 years ago

So, uh, as of the latest master the go tests at least seem to be passing in master with 1.5 when run normally. Sometimes (but of course inconsistently) the indexer tests hang when run with -p 1, but they never seem to hang when run without that flag. To top it all of, -race doesn't show any problems, and it never hangs when either -race or -v are provided. Still investigating (and I haven't taken chef-pedant to it yet; I imagine there's still problems there), but it's progress.

ctdk commented 8 years ago

OK, the hang in the tests is caused by there being two calls to i.m.RLock() under FileIndex's Save() function - the first being inside Save(), the second inside FileIndex's GobEncode(). For whatever reason this was never a problem before, but in go1.5.1 this sometimes causes a deadlock between Save() and SaveItem(). I've removed it; give it a whirl now and see how it does. Still running chef-pedant against 1.5.1 to see about the broken pipe errors though.

theckman commented 8 years ago

:+1: let me pull master and confirm

theckman commented 8 years ago

I can confirm that the tests now pass. Would you like to close this issue and mark it as resolved?

ctdk commented 8 years ago

Certainly.