cypress-io / cypress

Fast, easy and reliable testing for anything that runs in a browser.
https://cypress.io
MIT License
46.7k stars 3.16k forks source link

Cypress sometimes stalls/hangs with no output when running in Jenkins with Docker #8206

Closed kaiyoma closed 12 months ago

kaiyoma commented 4 years ago

Current behavior:

I feel like this kind of issue has been referenced many times before, but even after trying all the solutions I've found, we still continue to run into this problem. We've been seeing this issue for many months now, with early versions of Cypress 4.x and even with the most recent version today.

Long-running (multi-hour) test suites often "stall": the output stops appearing, but nothing actually fails or crashes. We actually have to rely on the timeout mechanism in Jenkins to kill the task because Cypress is just stuck.

I have enabled debug logs and honestly I don't see anything helpful in them. I'll upload the relevant portion next, but there are no mentions of failures, crashes, or even being out of memory. We're already following all the guidance in the Cypress docs about disabling /dev/shm and using the correct IPC setting.

Desired behavior:

Tests should either run to completion or Cypress should fail with a clear error.

Versions

Cypress: 4.12.1 OS: Linux Browser: Chrome 83 (headless)

kaiyoma commented 4 years ago

Here's an abridged version of the full Cypress debug log: cypress-debug-abridged.log

In our Jenkins output, this is the last message I see:

06:43:15        βœ“ should render a read & write view for Image Management (3909ms)

You can see this mocha test at the top of the logfile. But after that, no more mocha tests are triggered for some reason. Nothing runs out of memory, nothing crashes, nothing fails, nothing exits. Everything just stops responding and Cypress seems to be waiting for an event that'll never happen.

I can furnish the full debug logfile (13.2 MB) upon request.

AnkitPhalnikerTR commented 4 years ago

We are facing similar issue, Cypress just hangs indefinitely during UI test runs, no error, no logs. The only option is to close and run the tests again. We are running them locally, have not integrated with CI yet.

cellog commented 4 years ago

Also seeing this, about 1 every 50 test runs, Cypress will hang, occasionally for 12 hours(!) if no one notices

jennifer-shehane commented 4 years ago

The debug logs don't indicate anything that we can go off of. We'll need a way to reproduce this in order to investigate.

mblasco commented 4 years ago

I'm also experiencing the same issue. It seems that it's a recent issue. At least I was able to experience it with Cypress 4.11 and also 4.12.X. I even had to write down a script that cancels a test if it's running for more than 30 min and trigger it again. In our particular case we have been running 3 instances of Cypress in parallel in the same machine and it's been working for months. But now, as someone said, from time to time one of the tests hangs indefinitely, with no output indicating if there was a failure or something. I will try to gather some info of what's going on and post it here. About our environment, we are not running Jenkins nor Docker. It's just a linux machine running tests against an external website.

kaiyoma commented 4 years ago

The debug logs don't indicate anything that we can go off of. We'll need a way to reproduce this in order to investigate.

Run multi-hour test suites in a Jenkins/Docker environment until you start to see Cypress hanging. πŸ˜„

On a serious note, it sounds like several people are seeing this problem, and we all have access to such environments, so maybe you'll have to just leverage that rather than trying to repro yourself. Is there any other logging or debugging instrumentation we should turn on that would be helpful?

anaisamp commented 4 years ago

I've also experienced this. I don't have a long-running/multi-hour test suite though, I only have 2 test files so far. The tests have always passed locally. But when running inside a docker container, tests hanged most of the times, without any feedback, and I had to manually stop the Jenkins pipeline.

After a few days of trying things out, making these 3 changes helped me.

  1. Not using Chrome. Before I was using --browser chrome. When I removed it, I saw a helpful output, at the same stage where the tests used to hang.
Screenshot 2020-08-18 at 19 47 35
  1. I could see the error was related to a memory leak. So I added ipc and shm_size to the docker-compose file, as everyone was advising. (Had done it before, but seemed to have worked only after not using Chrome.)
version: '3.5'
services:
  app-build:
    ipc: host
    shm_size: '2gb'
  1. I'm using fixtures, and I trimmed them to use smaller .json files.

Please note I'm just sharing what seemed to have worked for me, but I don't consider this a solution. It sounds like an issue that should be looked into.

kaiyoma commented 4 years ago

Interesting. To the best of my knowledge, we've seen the opposite trend: using Electron was worse and switching to Chrome has slightly improved things. But we still run into this hanging problem with both browsers. (This would make sense if the problem is actually with mocha, which is kind of what the logs would suggest.)

bradyolson commented 4 years ago

FWIW, we were seeing this nearly every test suite execution (>90%) in our CI runner (AWS CodeBuild with parallelization, using Chrome), with no issues running locally. Today we upgraded to v5.0.0 and the hanging/stalling has seemingly stopped

kaiyoma commented 4 years ago

@bradyolson Good to know! We're attempting to upgrade Cypress this week, so maybe in a few days I can report back with our results.

cellog commented 4 years ago

I have an update on this:

I noticed that Cypress tests were downloading a few static assets in a continuous stream. The site logo, a font, etc. They would be downloaded hundreds of times in a single test, and in some cases completely overwhelm the browser, causing the test to fail. This happened locally too.

I downgraded to version 4.9.0 from 4.12.0 and the issue went away. I hope this gives some context to help.

For some other context: this happened in end-to-end mode and in mock mode (we run our tests in both modes) AND in a new mode using mock service worker, but NOT when running the app in development or in production through a regular Chrome browser.

kaiyoma commented 4 years ago

We finally got Cypress 5 to run and hit the same exact problem. No improvement.

kaiyoma commented 4 years ago

I spent the afternoon trying to run our tests with Firefox (which I realize is in "beta") and noticed that this problem seems a lot worse with Firefox. It's the same symptoms as before: console output stops appearing, mocha events appear to stop coming in, but the debug log keeps chugging along with memory and CPU updates.

When trying to run our usual battery of ~50 test suites, we couldn't get further than 5 suites, sometimes failing at even the first one. With Chrome, we could always get at least 25 or 30 suites in before the stall happens.

kaiyoma commented 4 years ago

I've tried a new approach/workaround in our project where instead of passing all our test suites to Cypress all at once, I pass the test suites one at a time.

Before:

cypress run --browser chrome --headless --spec foo.spec.js,bar.spec.js,baz.spec.js

After:

cypress run --browser chrome --headless --spec foo.spec.js
cypress run --browser chrome --headless --spec bar.spec.js
cypress run --browser chrome --headless --spec baz.spec.js

This seems to have helped a little bit, though last night we hit an error even with this approach. About 2 hours into the testing, a test suite stalled after running for only a couple minutes and Jenkins had to kill the task a couple hours later.

kaiyoma commented 4 years ago

We encountered this failure again last night with the same symptoms. Now that we're running our test suites one at a time, the logs are smaller and easier to digest. Still seems like Cypress is at fault here.

In the Jenkins output, we can see that the test suite stalled after executing for only a minute:

03:20:20    (Run Starting)
03:20:20  
03:20:20    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
03:20:20    β”‚ Cypress:    5.0.0                                                                              β”‚
03:20:20    β”‚ Browser:    Chrome 84 (headless)                                                               β”‚
03:20:20    β”‚ Specs:      1 found (topology/overlay.spec.js)                                                 β”‚
03:20:20    β”‚ Searched:   cypress/integration/topology/overlay.spec.js                                       β”‚
03:20:20    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
03:20:20  
03:20:20  
03:20:20  ────────────────────────────────────────────────────────────────────────────────────────────────────
03:20:20                                                                                                      
03:20:20    Running:  topology/overlay.spec.js                                                        (1 of 1)
03:20:20    Adding --disable-dev-shm-usage...
03:20:20    Adding --disable-gpu...
03:20:20    Adding --window-size=1920,1080...
03:20:38  
03:20:38  
03:20:38    Topology overlay tests
03:21:24      βœ“ should navigate to Topology application (40059ms)
03:21:24      βœ“ should have the right options in the select (352ms)
04:01:08  Cancelling nested steps due to timeout
04:01:08  Sending interrupt signal to process
04:01:20  Terminated

In the full debug logfile, I can see that this is the last event we get from mocha:

2020-09-03T10:21:18.281Z cypress:server:reporter got mocha event 'hook end' with args: [ { id: 'r5', title: '"before each" hook', hookName: 'before each', hookId: 'h2', body: ...

There are no more mocha events and the test never starts execution. This is also verified by the video file that Cypress writes out. I can see in the video that the sidebar never "expands" for the doomed test. Instead of seeing the list of get and assert calls, the spinner next to the test simply spins forever.

I can furnish the Jenkins log, Cypress debug log, and Cypress video upon request (in a DM or some other private channel).

mayrbenjamin92 commented 4 years ago

We have the same issue and we are using Cypress 5.2.0 as it seems to bring in several performance fixes. We are now using Cypress since the 2.X days and I feel that this is getting worse over time.

bahunov commented 3 years ago

Any update on this issue?

kaiyoma commented 3 years ago

Any update on this issue?

+1. Also waiting for updates. We continue to run into this issue often, even with the latest version (5.2.0). I've offered to furnish full logs, but no one's taken me up on it. Kind of feels like the Cypress folks are hoping this issue will just magically disappear.

danielcaballero commented 3 years ago

We are facing the same issue. Tried to update to 5.2.0 and all kind of workarounds but this is still happening.

muratkeremozcan commented 3 years ago

Similar issues here. There has been a runner change as well as upgrade to Cypress 5.2.0 . We have parallelization as well, and we are using the cypress included docker image.

Firefox tests hang indefinitely. With Electron, we can get the same crashes, but rarely the tests can work.

The trends seems to be tests with ui-login. For the time being we disabled them in CI.

This might help someone.

const ignoreOnEnvironment = (environment: 'linux' | 'darwin' | 'win32', fn) => {
  if (Cypress.platform !== environment) {
    fn();
  }
};

Then, in the test wrap any describe/context/it block

ignoreOnEnvironment('linux', () => {
  describe(....)
});
dragon788 commented 3 years ago

If you are running Chrome or Chromium it can be more useful to use --disable-dev-shm-usage as a flag rather than trying to set the --shm-size for Docker or Docker-Compose, especially when running in Docker-in-Docker on Jenkins or other CI providers.

https://developers.google.com/web/tools/puppeteer/troubleshooting#tips

https://github.com/cypress-io/cypress/issues/5336#issue-505290346

muratkeremozcan commented 3 years ago

If you are running Chrome or Chromium it can be more useful to use --disable-dev-shm-usage as a flag rather than trying to set the --shm-size for Docker or Docker-Compose, especially when running in Docker-in-Docker on Jenkins or other CI providers.

https://developers.google.com/web/tools/puppeteer/troubleshooting#tips

#5336 (comment)

This helped. Thank you!

The info resource can also be found that at Cypress docs. If your tests are hanging in the CI, take a look.

There are plenty of questions about configuring the plugins/index.js file. Here is a complex example, not even using the new --config-file pattern, instead using getConfigurationByFile pattern.

const fs = require('fs-extra')
const path = require('path')
// tasks go here
const percyHealthCheck = require('@percy/cypress/task')
const mailosaurTasks = require('./mailosaur-tasks')
// merge all tasks
const all = Object.assign({},
  percyHealthCheck,
  mailosaurTasks
)

function getConfigurationByFile(file) {
  const pathToConfigFile = path.resolve('cypress/config', `${file}.json`)

  return fs.readJson(pathToConfigFile)
}

module.exports = (on, config) => {
  on("task", all);
  // needed to address issues related to tests hanging in the CI
  on('before:browser:launch', (browser = {}, launchOptions) => {
    if (browser.family === 'chromium' && browser.name !== 'electron') {
      launchOptions.args.push('--disable-dev-shm-usage')
    }
    return launchOptions
  });

  const file = config.env.configFile || 'dev'
  return getConfigurationByFile(file)
}
AlexanderTunick commented 3 years ago

Similar issue here. Already throughout months. Cypress version "cypress": "5.1.0"

From time to time (several times in a week) the execution just stops on random tests. Screenshot 2020-12-04 at 11 10 38

No errors, no memory ran out. Just stopped and JENKINS continues its execution for "dead tests".

We checked a Server, Docker, Jenkins, and nothing outstanding there.

Can't reproduce it stably nohow, it happens periodically.

Garreat commented 3 years ago

https://github.com/cypress-io/cypress/issues/9350#issuecomment-739045845

prashantabellad commented 3 years ago

@bahmutov @jennifer-shehane Can anyone from Cypress team update on this?

gururajhm commented 3 years ago

Guys me to facing same issue. I have been using cypress for almost 1 yr 4 months never faced this issue, I updated to latest version then it keeps hanging or some time memory issues in local and in Circle CI too. Badly need a fix...

-G

hermape7 commented 3 years ago

I have spotted this also in the CircleCI jobs. I have disabled the shm-usage for the chrome (launchOptions.args.push("--disable-dev-shm-usage");) This "tests hanging" is occurring from time to time only (1 from 30 runs).

alexf101 commented 3 years ago

We're also experiencing this issue - about two to three minutes into a test suite that's otherwise working fine, the test runner hangs indefinitely. With debug logs enabled, there's absolutely nothing of interest to see, all logs stop except process_profiler, and there's no error message. I can reproduce this 100% of the time, the only variance is at what point in the test suite it'll hang.

The test suite in question uses the visual diff plugin to take snapshots of some of our UIKit reusable components in various configurations.

When I play with the shm_size or --disable-dev-shm-usage flag it doesn't have much affect - maybe got through a little further on average using a large shm as opposed to disabling shm.

Cypress 5 or 6 didn't make a difference either, both have this symptom.

What did help was making the test suite smaller - I split it three ways so that each suite took under a minute, and now it passes reliably.

This is a really unstable solution, as some unsuspecting developer is going to come along and add some more tests to an existing suite, and the result is going to be that the test suite in question starts hanging indefinitely and ruin their day.

I'm happy to provide DEBUG logs or anything like that if it helps find a solution - as I said, I can reproduce this hang reliably. I can't send you my Docker container though as our application code under test is proprietary.

prashantabellad commented 3 years ago

@bahmutov @jennifer-shehane Please can this be prioritized else this is a dampener

tsuna commented 3 years ago

What did help was making the test suite smaller - I split it three ways so that each suite took under a minute, and now it passes reliably.

Anecdotally, this is also our workaround. We used to run all our test suites in a single go, and this would cause Cypress to lock up much more often. Now we run each test suite separately, and the problem still occurs but a lot less frequently. It's very annoying because there is a huge startup cost to spin up Cypress for each test suite, so this makes our entire end-to-end tests take 3h instead of 2h or less.

We also disable pass --disable-dev-shm-usage as Docker containers typically have a tiny /dev/shm, and we run everything in them.

bahunov commented 3 years ago

What did help was making the test suite smaller - I split it three ways so that each suite took under a minute, and now it passes reliably.

Anecdotally, this is also our workaround. We used to run all our test suites in a single go, and this would cause Cypress to lock up much more often. Now we run each test suite separately, and the problem still occurs but a lot less frequently. It's very annoying because there is a huge startup cost to spin up Cypress for each test suite, so this makes our entire end-to-end tests take 3h instead of 2h or less.

We also disable pass --disable-dev-shm-usage as Docker containers typically have a tiny /dev/shm, and we run everything in them.

My issue disappeared when I did following, (still using v.5 with server/route feature):

1.Run garbage collection after each - works in chrome not sure about electron.

  1. Set numTestsKeptInMemory to 0
dragon788 commented 3 years ago

One thing that we ran into was the previously silent failures of doing things in beforeEvent or afterEvent hooks would also cause this type of exception/hang.

There were some warnings and some handling added is exceptions occur outside the normal flow, but a lot of times the issue happens if you are trying to look at some state in a spot where it might not be ready or has already been discarded.

ankit-madan commented 3 years ago

What did help was making the test suite smaller - I split it three ways so that each suite took under a minute, and now it passes reliably.

Anecdotally, this is also our workaround. We used to run all our test suites in a single go, and this would cause Cypress to lock up much more often. Now we run each test suite separately, and the problem still occurs but a lot less frequently. It's very annoying because there is a huge startup cost to spin up Cypress for each test suite, so this makes our entire end-to-end tests take 3h instead of 2h or less. We also disable pass --disable-dev-shm-usage as Docker containers typically have a tiny /dev/shm, and we run everything in them.

My issue disappeared when I did following, (still using v.5 with server/route feature):

1.Run garbage collection after each - works in chrome not sure about electron.

  1. Set numTestsKeptInMemory to 0

How do you run garbage collection in Cypress?

kaiyoma commented 3 years ago

It's been a few months since I commented, so I'll add an update. We still see this problem decently often: not a ton, but enough to possibly consider Cypress alternatives. We see this problem at least once a day and "Just retrigger the task" is becoming a tiresome refrain.

We've been able to track down a couple of our recurring stalls to our AUT itself. In some cases, our UI is doing a lot of React re-rendering, data crunching, or other high-CPU activities, and the browser seems to be hogging all the CPU and starving the Cypress process. If others are seeing these stalls happening in the same tests over and over, I'd suggest looking into a high CPU problem.

However, the majority of these hangs happen in a different test every time, usually when testing a part of the UI that isn't particularly resource-intensive. I'm looking at an example from 30 minutes ago and I can see that the test stalled 90 seconds into execution, which probably rules out any kind of memory leak or OOM situation.

Any advice from the Cypress folks about things we could do on our end to gather more information for you guys?

prashantabellad commented 3 years ago

@jennifer-shehane @bahmutov any updates/comments from your side?

retypepassword commented 3 years ago

My issue disappeared when I did following, (still using v.5 with server/route feature): 1.Run garbage collection after each - works in chrome not sure about electron.

  1. Set numTestsKeptInMemory to 0

How do you run garbage collection in Cypress?

@ankit-madan See https://github.com/cypress-io/cypress/issues/8525, except I had to use --expose_gc instead of --expose-gc.

retypepassword commented 3 years ago

We've been experiencing this issue in CircleCI (stalls/hangs with no output) with Cypress v6 and Chrome. I have logs from socket.io:socket to add (see below) if it helps.

I've noticed this sequence:

  1. "socket connected" event fires (let's call this socket 1)
  2. automation:client:connected fires and automationClient = socket occurs.
  3. "socket connected" fires again (socket 2)
  4. ???
  5. socket 2 disconnects automationClient.on('disconnect', …) on this line doesn't fire when socket 2 disconnects, so no error message is logged. All the other events in the Socket class are handled using socket 2 (because they use socket.on(), not automationClient.on())
  6. Cypress hangs with no further output

For what it's worth, adding socket.on('disconnect', …) to the top level of the this.io.on('connection', (socket) => { block catches the disconnect event. I've also seen Cypress hang with no output if the socket disconnects and then reconnects.

────────────────────────────────────────────────────────────────────────────────────────────────────

  Running:  [redacted]_spec.ts                                                         (5 of 13)
socket connected - writing packet <------- socket connected
joining room uH9QsXHm0ty8jFKzAAAI
packet already sent in initial handshake
joined room [ 'uH9QsXHm0ty8jFKzAAAI' ]
got packet {"type":2,"nsp":"/","data":["automation:client:connected"]} <-------- automationClient = socket
emitting event ["automation:client:connected"]
dispatching an event ["automation:client:connected"]
socket connected - writing packet <-------- socket connected
joining room pCQJu9N7VhNgLysHAAAJ
packet already sent in initial handshake
joined room [ 'pCQJu9N7VhNgLysHAAAJ' ]
got packet {"type":2,"nsp":"/","data":["runner:connected"]}
emitting event ["runner:connected"]
dispatching an event ["runner:connected"]
joining room runner
joined room [ 'runner' ]
got packet {"type":2,"nsp":"/","id":0,"data":["is:automation:client:connected",{"element":"__cypress-string","string":"0.25538962826612965"}]}
emitting event ["is:automation:client:connected",{"element":"__cypress-string","string":"0.25538962826612965"}]
attaching ack callback to event
dispatching an event ["is:automation:client:connected",{"element":"__cypress-string","string":"0.25538962826612965"},null]
sending ack [true]
got packet {"type":2,"nsp":"/","data":["app:connect","wfk9l"]}
emitting event ["app:connect","wfk9l"]
dispatching an event ["app:connect","wfk9l"]
got packet {"type":2,"nsp":"/","data":["spec:changed","integration/[redacted]_spec.ts"]}
emitting event ["spec:changed","integration/[redacted]_spec.ts"]
dispatching an event ["spec:changed","integration/[redacted]_spec.ts"]
got packet {"type":2,"nsp":"/","data":["watch:test:file",{"name":"[redacted]_spec.ts","relative":"cypress/integration/[redacted]_spec.ts","absolute":"/home/********/********/cypress/integration/[redacted]_spec.ts","specType":"integration"}]}
emitting event ["watch:test:file",{"name":"[redacted]_spec.ts","relative":"cypress/integration/[redacted]_spec.ts","absolute":"/home/********/********/cypress/integration/[redacted]_spec.ts","specType":"integration"}]
dispatching an event ["watch:test:file",{"name":"[redacted]_spec.ts","relative":"cypress/integration/[redacted]_spec.ts","absolute":"/home/********/********/cypress/integration/[redacted]_spec.ts","specType":"integration"}]
closing socket - reason transport close
norvinino commented 3 years ago

@jennifer-shehane Cypress closes automatically and stops working when a test is run

pnilesh10 commented 3 years ago

We moved to V 5.3.0 and started using Electron (headless mode)instead of chrome. Didn't encounter any stalls since then.

KeKs0r commented 3 years ago

I have tried running my e2e tests on 3x bigger containers, which did not seem to have an impact as well as running garbage collection between tests. So I am not sure if this is only caused by ressource limitation or there are other things at play.

heikomat commented 2 years ago

We were noticing the same symptom in our tests. When running on CI, the runners often just stop working after a while.

In our case the problem was not with cypress, but with the host machine

Looking at /var/log/syslog and dmesg we noticed a lot of logs that looked like this:

[11512495.394975] TCP: out of memory -- consider tuning tcp_mem
[11512581.913531] TCP: out of memory -- consider tuning tcp_mem
[11512616.330274] TCP: out of memory -- consider tuning tcp_mem
[11512640.837935] TCP: out of memory -- consider tuning tcp_mem
[11512676.482550] TCP: out of memory -- consider tuning tcp_mem
[11512748.301204] TCP: out of memory -- consider tuning tcp_mem

After reading this blogpost we looked for more info on the rmem_max and wmem_max values and decided to pick the values from this article.

It basically explains these values and ends with

To change them, put this into your /etc/sysctl.conf

# increase Linux TCP buffer limits
net.core.rmem_max = 2097152
net.core.wmem_max = 2097152

# increase Linux autotuning TCP buffer limits
# min, default, and max number of bytes to use
net.ipv4.tcp_rmem = 4096 87380 2097152
net.ipv4.tcp_wmem = 4096 65536 2097152

And run sysctl -p to apply them

So we did, and since then two things changed

  1. The TCP: out of memory-logs completely stopped appearing
  2. The cypress runners no longer stopped working mid-run and our tests went smooth
bahunov commented 2 years ago

I had similar issues in the pipeline but it turned out to be the Java backend that was eating up all of the resources. The problem got fixed when I put limit on resource for the backend service.

On Fri, 4 Feb 2022 at 11:48, Heiko Mathes @.***> wrote:

We were noticing the same symptom in our tests. When running on CI, the runners often just stop working after a while.

In our case the problem was not with cypress, but with the host machine

Looking at /varlog/syslog and dmesg we noticed a lot of logs that logs that looked like this:

[11512495.394975] TCP: out of memory -- consider tuning tcp_mem [11512581.913531] TCP: out of memory -- consider tuning tcp_mem [11512616.330274] TCP: out of memory -- consider tuning tcp_mem [11512640.837935] TCP: out of memory -- consider tuning tcp_mem [11512676.482550] TCP: out of memory -- consider tuning tcp_mem [11512748.301204] TCP: out of memory -- consider tuning tcp_mem

After reading this blogpost https://dzone.com/articles/tcp-out-of-memory-consider-tuning-tcp-mem we looked for more info on the rmem_max and wmem_max values and decided to pick the values from this article https://www.tecchannel.de/a/tcp-ip-tuning-fuer-linux,429773,6.

It basically explains these values and ends with

To change them, put this into your /etc/sysctl.conf

increase Linux TCP buffer limits

net.core.rmem_max = 2097152 net.core.wmem_max = 2097152

increase Linux autotuning TCP buffer limits

min, default, and max number of bytes to use

net.ipv4.tcp_rmem = 4096 87380 2097152 net.ipv4.tcp_wmem = 4096 65536 2097152

And run sysctl -p to apply them

So we did, and since then two things changed

  1. The TCP: out of memory-logs completely stopped appearing
  2. The cypress runners no longer stopped working mid-run and our tests went smooth

β€” Reply to this email directly, view it on GitHub https://github.com/cypress-io/cypress/issues/8206#issuecomment-1029732700, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFKDNFFTER3JMRCB6I7THNLUZOAD7ANCNFSM4PWYC6YA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

kaiyoma commented 2 years ago

Since I posted this originally, I'll post an update of where we're at with this issue. We're on Cypress 9 now, and we're also using the Cypress Docker image (which was not true when I first posted this). Things have definitely gotten better since the summer of 2020 and now our Cypress tests are actually quite reliable.

We do run into this stalling issue occasionally, but now when it happens, it always points to a problem with the application under test, and it's always a high CPU issue. (For a long time, we thought it could be memory-related, and tried all sorts of things to address that, but they never made a difference.) Whenever we start seeing Cypress stall for a test, we'll investigate that specific part of the UI, and we always find that it's pretty sluggish, either because it's stuck in a rendering loop or doing too much other processing.

I'll note that we're still running all our Cypress tests one at a time (invoking cypress separately for each one). I haven't tried going back to running all of them at once since we're pretty stable at the moment and I don't want to rock the boat. For now, we're pretty content since we know what the stalling issue means for us. I wish the error/failure was more direct and obvious, but we have enough expertise with Cypress now to recognize this situation.

tsuna commented 2 years ago

On Fri, Feb 4, 2022 at 5:05 PM Kyle Getz @.***> wrote:

I'll note that we're still running all our Cypress tests one at a time (invoking cypress separately for each one). I haven't tried going back to running all of them at once since we're pretty stable at the moment and I don't want to rock the boat

Well, if things are working reliably now, maybe it's time to try to run cypress just once for all the tests, as this would save a lot of time. And if we reproduce the issue in doing so, we can always roll back and report back here, as it would still point to an issue inherent to Cypress.

martincarstens commented 2 years ago

We were having the same issue in GitHub Actions. I wanted to post this here in case it helps anyone. I haven't seen this approach mentioned so wanted to make sure it gets added to the list of possible solutions.

Long story short, we've had too many it blocks. Those blocks are resource intensive according to Cypress's best practice documentation: https://docs.cypress.io/guides/references/best-practices#Creating-tiny-tests-with-a-single-assertion. Our CI containers were running out of resources.

We used them as a way to group assertions and we loved seeing all those green checkmarks. The solution was to remove them where they were not absolutely needed. We converted most of them to context blocks, which don't show anything in stdout, but they help us during development to group assertions logically.

stuart-clark commented 2 years ago

In an effort to help anyone, here's a list of things folks have tried with some degree of success:

I also saw some success using timeout (or gtimeout) while wrapping our Cypress command in retry-action ... just mind parsing nested : characters unless you use @dmvict's latest fork.

🍻

Before edit:

We are now configuring our runner to run on EC2 per ec2-github-runner by @machulav. It requires a little AWS know-how, but it's a great resource.

Using ec2-github-runner in addition to @kaiyoma's suggestion to run specs one file at a time allows our tests to pass quickly and reliably.

Being on EC2 lets you ssh into the instance, look around, troubleshoot, and control and allocate CPU and/or memory.

However, even in a t2.2xlarge (32 GiB of memory and 8 vCPU) we still ran into timeouts unless we split our tests with @kaiyoma's method.

My anecdotal feedback is that it seems highly correlated to it blocks and what happens-around-them, but I have never seen an exception nor did I see memory or CPU issues in our EC2 instance, even while debugging while Cypress was stalling. I noticed slowness in using beforeEach hooks, so we removed those too.

Wandalen commented 2 years ago

@stuart-clark you may use wretry.action it's up to date

martincarstens commented 2 years ago

@stuart-clark Are you able to share your script for executing tests one at a time? We are currently doing this: https://gist.github.com/martincarstens/70dd91176e420c6c9ca3de96cafac71d - feel free to use it or adapt for your own use.

viriatis commented 2 years ago

I added this to my jenkins-slave dockerfile that I use to run cypress on jenkins and the build doens't get stuck anymore:

`# In summary, setting CI=true causes npm to set the Npm-In-CI header to true, 
# thus as a result the data gathered (by npm) assumes the package(s) are being installed via a "build farm", 
# (i.e. for Continuous Integration purposes), instead of a "human".
ENV CI=1 \
# disable shared memory X11 affecting Cypress v4 and Chrome
# https://github.com/cypress-io/cypress-docker-images/issues/270
  QT_X11_NO_MITSHM=1 \
  _X11_NO_MITSHM=1 \
  _MITSHM=0 \
  # point Cypress at the /root/cache no matter what user account is used
  # see https://on.cypress.io/caching
  CYPRESS_CACHE_FOLDER=/root/.cache/Cypress`
mattvb91 commented 2 years ago

Upgraded to 9.6.0 and now started seeing this issue where all the tests succeed but cypress just hangs and never exits or sends a success signal. Just hangs there after displaying the test results.