Closed kkallday closed 5 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/156860071
The labels on this github issue will be updated when the story is started.
Hey @kkallday and @heycait. Thanks for bringing this up. I just prioritize the story for the engineering team to dig into the flakiness. We'll try to get to it soon.
Thanks @dsabeti !
Hey folks, we have few questions.
This test is one of the culprits as it relies on the scheduler always taking the same action. I don't think its deterministic because when tests are running in parallel on 3 cells there's a chance the apps won't all be placed on the same cell.
Maybe someone from the windows team could chime in since I don't have much context on the scheduler logic.
cc @aminjam @sesmith177 @ankeesler
Hey folks, I did some investigation into these Windows flakes today (specifically the metrics one, since that one seems to be popping up the most). Here are some notes from my findings, would be interested in hearing people's opinions.
I looked a little more in depth into the metrics test that is super flaky in CATs. I wondered a couple of things.
- Maybe we should check [this error channel](https://github.com/cloudfoundry/cf-acceptance-tests/blob/master/windows/metrics.go#L54) to make sure the connection to doppler isn't bunk
- The `curl` to the `nora` app is taking a long time, as we are seeing elsewhere. Check out the time difference between the `curl` invocation [here](https://release-integration.ci.cf-app.com/teams/main/pipelines/cf-deployment/jobs/stable-periodic-cats/builds/12537#L5ae0c7ed:404) and the next `cf` command [here](https://release-integration.ci.cf-app.com/teams/main/pipelines/cf-deployment/jobs/stable-periodic-cats/builds/12537#L5ae0c7ed:404). This ~40 second timeout could be screwing things up
- Seems like the app is indeed printing out that `Muahaha` message [here](https://release-integration.ci.cf-app.com/teams/main/pipelines/cf-deployment/jobs/stable-periodic-cats/builds/12537#L5ae0c7ed:451), so the issue seems to be something going on after that (like the sending of that info to `metron_agent` or `doppler`. In the BOSH logs from one of the windows 2016 cells, I saw this message on the `metron_agent_windows` job, which looked suspicious: `Dropped 10000 v2 envelopes`
@ankeesler this report is great! I'm no longer on PAS releng but they will be tracking this issue so that we can eventually scale back up to testing with 3 nodes and have more confidence that tests are going red for legitimate reasons.
cc @davewalter and others on releng
Also curious you hear your thoughts about my comment above about the test that asserts on scheduling behavior https://github.com/cloudfoundry/cf-acceptance-tests/issues/287#issuecomment-384729574 During your experiment did you deploy with multiple nodes? I think this test will be flakey depending on cell capacity.
@kkallday @davewalter are you configuring num_windows_cells
(see here https://github.com/cloudfoundry/cf-acceptance-tests/blob/master/README.md#optional-parameters) in your CATS config? This test relies on scaling an app up to (num cells + 1) instances, to guarantee that there is at least 1 cell with at least 2 instances on it.
That said, we should probably move this test to our regression test suite as its purpose is to cover a regression we had seen a long time ago in windows2012r2
Hey Folks,
@kkallday @davewalter @ankeesler It's been quite a while since this was originally submitted and we've been working on CATs flake reduction... so I'm following up to see this issue persists.
If it is still an issue, please comment and reopen.
Thanks so much,
Josh
Thanks @heyjcollins. I am no longer on the team either so I'm not sure what the current windows cell configuration is. @ljfranklin, it might be a good idea to revisit the issues in this report to see if they are still valid.
Dave
We scaled our Windows cells back up to 3 several months ago, CATs seem to be working fine now with multiple cells.
Hi team,
We've noticed that the windows suite is flakey when running against deployments with more than one diego cell. Here is a list of the tests so far they are known to be flakey:
This is a collection of some flakey tests. There may be more but to unblock our pipeline we scaled down to 1 cell (which is the configuration that the windows/greenhouse team test in their pipeline).
We were running the tests in parallel with 7 ginkgo nodes.
Affected suites: include_windows, use_windows_test_task and use_windows_context_path
cc @ankeesler @aminjam
Thanks,
Kevin and @heycait