MarkyMarkMcDonald commented 3 years ago

We use rspec-retry to deal with flakey browser specs. Unfortunately we're currently running into an issue where firefox / selenium / geckodriver crashes and is unrecoverable (we believe it's related to docker memory / too many file handles). All the retry attempts of the spec fail, and any further specs that use the browser also fail.

We're working towards solving the underlying issue, but is there a mechanism with knapsack to retry failed specs on another worker?

We've had a few other issues that have caused an entire worker to be inoperable (like postgres crashing and specs not waiting for it to recover), so I think this is applicable to more than just our current issue.

ArturT commented 3 years ago

Hi @MarkyMarkMcDonald

is there a mechanism with knapsack to retry failed specs on another worker?

No.

Even if there was such a mechanism there might be edge cases. When could we consider that all parallel jobs completed work properly? Let's say you have a problematic CI job that has the firefox process failing and you can't run tests there. Let's say tests are put back in the Queue in knapsack_pro Queue Mode so that other parallel CI jobs can consume it. But it's possible that all jobs processed the tests at a similar time as the problematic CI job failed and there will be no available parallel jobs ready to pick up the tests from the problematic CI job.

Most likely there are ways to solve this edge case and maybe force CI jobs to wait for some time till all test files are acknowledge by CI jobs so that we know all jobs executed tests.

There would be also edge case when tests are not acknowledge by CI jobs and we would have to handle that as well - maybe with some time out.

Probably there are more edge cases we need to consider as well. These are some on the top of my head.

Most likely there is no simple solution right a way to handle this. We would have to collect more feedback from other users if they would find it useful to auto-assign tests to other jobs when the CI job can't run tests and then try to find a simple solution as possible to avoid edge cases.

The simplest action for you to take for now could be:

What CI provider do you use? Some of them like Buildkite allows to automatically retry failed jobs in a new isolated machine so this should restart the Firefox. Maybe there is a way to configure auto-retry of the parallel job on your CI server?
You could try to add more resources CPU/RAM/disk to CI server to see it's more stable. Upgrade to the latest Firefox to ensure it's not some bug.

What is your organization ID or email? You can send it to support@knapsackpro.com and I can review your account. Do you use Queue Mode or Regular Mode?

MarkyMarkMcDonald commented 3 years ago

Thanks @ArturT - sorry for the delay, this slipped my radar.

This all makes sense to me, thanks for detailing out everything 👍 .

For our specific case:

We're using CircleCI. I don't think there's an automated retry - I think we'd have to write some custom retry logic, like "Detect if firefox crashed based on the rspec exceptions and have a dependent job run".
Lack of resources has been suggested by various stackoverflows when digging into the root of the crashes - we've tried tweaking the worker sizes, but it's still reproducible with the next size up. I'll take a look at switching to the latest firefox and geckodriver, that's a good call 👍 .

I'll send in that email and try to find a few examples of runs exhibiting the crashes, thanks!

ArturT commented 3 years ago

I'm pasting here idea that might be useful for others looking at this issue:

You could collect test file paths of failing tests from all parallel nodes and generate file KNAPSACK_PRO_TEST_FILE_LIST_SOURCE_FILE=tmp/my_failing_specs.txt You could extract that from junit XML report. https://knapsackpro.com/faq/question/how-to-use-junit-formatter#how-to-use-junit-formatter-with-knapsack_pro-queue-mode

Use this list of test files tmp/my_failing_specs.txt to initialize a new CI build (or dependent CircleCI job): https://knapsackpro.com/faq/question/how-to-run-a-specific-list-of-test-files-or-only-some-tests-from-test-file (please ensure you updated knapsack_pro gem version first) Please ensure tmp/my_failing_specs.txt list of test files must be the same set of tests for all parallel nodes. Only one parallel CI node will initialize the Queue with a set of tests (the one that very first hits our API endpoint - we don't know which one it will be. That is why you must have the same set of test files in tmp/my_failing_specs.txt on all parallel CI nodes. You would have to collect the list of failed test files from all CI nodes and merge it into one file tmp/my_failing_specs.txt before you run a new CI build (or a new job with retried failed tests).

ArturT commented 1 year ago

story

The idea of running failed tests on another worker/CI node is part of the idea of improving Queue API. Related internal ticket: https://trello.com/c/KjXa29IJ

MarkyMarkMcDonald commented 1 year ago

We're not running into the original problem I posted about anymore (feel free to close this issue), but as an FYI - circleci has started experimental support for "rerun failed tests only".

We're trying this out and here's how we combine circleci failure retries with knapsack:

    # Use circleci cli to find out if we need to run all tests or just failed tests
    # We are telling circleci to split the tests across 1 node to get the full list of all tests for consideration. We leave the splitting to Knapsack Pro.
    circleci tests glob "spec/**/*_spec.rb" | circleci tests run --index 0 --total 1 --command ">files.txt xargs echo" --verbose > files.txt

    # replace all spaces with newlines in files.txt file
    sed -i 's/ /\n/g' files.txt

    # tell knapsack pro to run only tests from files.txt (and still use queueing magic)
    if [[ -s "files.txt" ]]; then
      export KNAPSACK_PRO_TEST_FILE_LIST_SOURCE_FILE=files.txt
      bundle exec rake knapsack_pro:queue:rspec["${EXCLUDE_TAGS} ${FORMATTER_OPTIONS}"]
    fi

ArturT commented 1 year ago

@MarkyMarkMcDonald Thanks for sharing the example.

ArturT commented 11 months ago

SOLUTION

Here is an example of how to rerun only failed tests on CircleCI: https://docs.knapsackpro.com/ruby/circleci/#rerun-only-failed-tests

KnapsackPro / knapsack_pro-ruby

Is it possible to retry failing specs on another worker? #157

story

SOLUTION