First pass at upgrading BOSH to use v2.0 manifests

sethlindberg commented 8 years ago

Right now the only upgrade advice is "remove redis from bosh" -- there are probably other steps that need to be taken to get this working, but right now this works with a fresh, deployed install of cloudfoundry.

jahio commented 8 years ago

@sethlindberg Can you elaborate on why we're dropping redis here as well as link to and/or include that screenshot from earlier today where bosh maintainers in slack told you to drop redis? Just for posterity and those who find this via "teh googlz" in the future so it's clear why this PR nukes redis.

Bonus question: if we're dropping redis in this latest release, what was it being used for and what's taking its place now?

krutten commented 8 years ago

Redis was officially dropped in v256

https://github.com/cloudfoundry/bosh/releases/tag/stable-3232

jahio commented 8 years ago

So that future googlers don't have to go tracking it down, this is the relevant excerpt from the release notes where / why redis was dropped:

Switched to using delayed job instead of Resque for managing Director tasks

Warning: make sure to update your Director manifest (used with bosh-init) to remove mentions of redis.

However, I think that might be a more interesting story: _why the switch from resque to delayed_job?_ In my experience I've always seen people and their apps going the other direction (abandoning DJ for resque) so this is quite a salient event which to my thinking would make for good technical discussion. Obviously outside the scope of this PR/thread but something I thought I'd mention.

sethlindberg commented 8 years ago

Can we merge this in? It's usable now.

jahio commented 8 years ago

@sethlindberg After doing additional testing I found that there are issues with what looks like their delayed_job implementation causing timeouts during make provision that make the whole stack explode (basically). Here's a relevant snippet of a log where I had this happen:

E, [2016-06-07 20:59:26 #7884] [] ERROR -- DirectorJobRunner: Worker thread raised exception: Timed out sending 'get_state' to c1c8d3c1-9bc3-472c-8593-7cdb74adbf0c after 45 seconds - /var/vcap/packages/director/gem_home/ruby/2.1.0/gems/bosh-director-1.3232.2.0/lib/bosh/director/agent_client.rb:215:in `block in handle_method'
/var/vcap/packages/ruby/lib/ruby/2.1.0/monitor.rb:211:in `mon_synchronize'

DirectorJobRunner: Worker thread raised exception: Timed out sending 'get_state' to c1c8d3c1-9bc3-472c-8593-7cdb74adbf0c after 45 seconds

(Emphasis added to aid the casual skimmer :smiley:)

This doesn't really look like something you did wrong, but it does look like a bug in their stack or perhaps some form of timeout we need to investigate. Unless there's a really good reason to, I'd prefer not to just jack up the timeout threshold because it sounds like it might be masking the real problem underneath. That said, I'll leave it up to you to investigate.

cloudfoundry-community-attic / aws-nat-bastion-bosh-cf

First pass at upgrading BOSH to use v2.0 manifests #37