windows suite is flakey on deployments with more than one windows diego cell

kkallday commented 6 years ago

Hi team,

We've noticed that the windows suite is flakey when running against deployments with more than one diego cell. Here is a list of the tests so far they are known to be flakey:

[windows] SSH ssh records successful ssh attempts 
/tmp/build/33ac16d1/go/src/github.com/cloudfoundry/cf-acceptance-tests/windows/ssh.go:173
[windows] WCF can have multiple routable instances on the same cell 
/tmp/build/33ac16d1/go/src/github.com/cloudfoundry/cf-acceptance-tests/windows/wcf.go:49
[windows] app logs captures stderr logs with the correct tag 
/tmp/build/33ac16d1/go/src/github.com/cloudfoundry/cf-acceptance-tests/windows/running_log.go:59
[windows] Metrics shows logs and metrics 
/tmp/build/33ac16d1/go/src/github.com/cloudfoundry/cf-acceptance-tests/windows/metrics.go:51
...

This is a collection of some flakey tests. There may be more but to unblock our pipeline we scaled down to 1 cell (which is the configuration that the windows/greenhouse team test in their pipeline).

We were running the tests in parallel with 7 ginkgo nodes.

Affected suites: include_windows, use_windows_test_task and use_windows_context_path

cc @ankeesler @aminjam

Thanks,

Kevin and @heycait

cf-gitbot commented 6 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/156860071

The labels on this github issue will be updated when the story is started.

dsabeti commented 6 years ago

Hey @kkallday and @heycait. Thanks for bringing this up. I just prioritize the story for the engineering team to dig into the flakiness. We'll try to get to it soon.

kkallday commented 6 years ago

Thanks @dsabeti !

njbennett commented 6 years ago

Hey folks, we have few questions.

Are you seeing this behavior on 2012, 2016, or both?
Are you running CATs from master, or some other SHA?
How was the environment configured? What IaaS, what ops-files?

kkallday commented 6 years ago

Both
Using latest cats
GCP, external GCP load balancer, 3 windows diego cells, latest windows 2012/2016 stemcells

This test is one of the culprits as it relies on the scheduler always taking the same action. I don't think its deterministic because when tests are running in parallel on 3 cells there's a chance the apps won't all be placed on the same cell.

Maybe someone from the windows team could chime in since I don't have much context on the scheduler logic.

cc @aminjam @sesmith177 @ankeesler

ankeesler commented 6 years ago

Hey folks, I did some investigation into these Windows flakes today (specifically the metrics one, since that one seems to be popping up the most). Here are some notes from my findings, would be interested in hearing people's opinions.

I looked a little more in depth into the metrics test that is super flaky in CATs. I wondered a couple of things.
- Maybe we should check [this error channel](https://github.com/cloudfoundry/cf-acceptance-tests/blob/master/windows/metrics.go#L54) to make sure the connection to doppler isn't bunk
- The `curl` to the `nora` app is taking a long time, as we are seeing elsewhere. Check out the time difference between the `curl` invocation [here](https://release-integration.ci.cf-app.com/teams/main/pipelines/cf-deployment/jobs/stable-periodic-cats/builds/12537#L5ae0c7ed:404) and the next `cf` command [here](https://release-integration.ci.cf-app.com/teams/main/pipelines/cf-deployment/jobs/stable-periodic-cats/builds/12537#L5ae0c7ed:404). This ~40 second timeout could be screwing things up
- Seems like the app is indeed printing out that `Muahaha` message [here](https://release-integration.ci.cf-app.com/teams/main/pipelines/cf-deployment/jobs/stable-periodic-cats/builds/12537#L5ae0c7ed:451), so the issue seems to be something going on after that (like the sending of that info to `metron_agent` or `doppler`. In the BOSH logs from one of the windows 2016 cells, I saw this message on the `metron_agent_windows` job, which looked suspicious: `Dropped 10000 v2 envelopes`

kkallday commented 6 years ago

@ankeesler this report is great! I'm no longer on PAS releng but they will be tracking this issue so that we can eventually scale back up to testing with 3 nodes and have more confidence that tests are going red for legitimate reasons.

cc @davewalter and others on releng

Also curious you hear your thoughts about my comment above about the test that asserts on scheduling behavior https://github.com/cloudfoundry/cf-acceptance-tests/issues/287#issuecomment-384729574 During your experiment did you deploy with multiple nodes? I think this test will be flakey depending on cell capacity.

sesmith177 commented 6 years ago

@kkallday @davewalter are you configuring num_windows_cells (see here https://github.com/cloudfoundry/cf-acceptance-tests/blob/master/README.md#optional-parameters) in your CATS config? This test relies on scaling an app up to (num cells + 1) instances, to guarantee that there is at least 1 cell with at least 2 instances on it.

That said, we should probably move this test to our regression test suite as its purpose is to cover a regression we had seen a long time ago in windows2012r2

heyjcollins commented 5 years ago

Hey Folks,

@kkallday @davewalter @ankeesler It's been quite a while since this was originally submitted and we've been working on CATs flake reduction... so I'm following up to see this issue persists.

If it is still an issue, please comment and reopen.

Thanks so much,

Josh

davewalter commented 5 years ago

Thanks @heyjcollins. I am no longer on the team either so I'm not sure what the current windows cell configuration is. @ljfranklin, it might be a good idea to revisit the issues in this report to see if they are still valid.

Dave

ljfranklin commented 5 years ago

We scaled our Windows cells back up to 3 several months ago, CATs seem to be working fine now with multiple cells.

cloudfoundry / cf-acceptance-tests

windows suite is flakey on deployments with more than one windows diego cell #287