feature request: fail fast across nodes

mavenraven commented 1 month ago

Hi @ArturT! Thank you for creating Knapsack Pro!

@devinburnette and I are trying to enable --fail-fast in CI, but it currently will only fail the current parallel node and not the others. We were wondering about the feasibility of adding the ability to fail across all nodes, which would save build cost and time?

ArturT commented 1 month ago

Hi @mavenraven

In RSpec docs, I can see the following:

 --[no-]fail-fast[=COUNT]       Abort the run after a certain number of failures (1 by default).

Do you specify the number of failures? What's the number in your case?

When the --fail-fast flag is useful to you? Is it when someone commits changes that break too many tests, and there is no point in running such a build? Is it better to cancel it to save CI runtime costs? Are there other cases?

All nodes should be terminated if the certain number of tests fails across all nodes. If all nodes fail because they did not complete the test suite, then it would be hard to know which node executed failed tests. Perhaps it would be better to mark as failed nodes only these with failing tests and the rest of nodes could report as successful but with warning that the execution of tests has been canceled.

ArturT commented 1 month ago

A workaround solution could be to use junit or json formatter. https://docs.knapsackpro.com/ruby/rspec/#junit-formatter If a CI node fails with --fail-fast 3 then you could parse json report, count number of failed tests in it. If there are >=3 failing tests then you terminate the CI build using your CI provider API.

mavenraven commented 1 month ago

Do you specify the number of failures? What's the number in your case?

We don't specify a number (see more below).

When the --fail-fast flag is useful to you? Is it when someone commits changes that break too many tests, and there is no point in running such a build?

Yep, that's right. We don't want to continue expending CI credits once we know that a given PR can't succeed.

Is it better to cancel it to save CI runtime costs? Are there other cases?

The problem with canceling is that, at least on CircleCI, there's no way to end the workflow in a failed state. It's important to end in a failed state for our metrics. This is also why the suggested workaround won't work.

All nodes should be terminated if the certain number of tests fails across all nodes.

My understanding is that, with --fail-fast set, Knapsack Pro will break out of the loop that pulls down more tests from the queue, but it will not communicate that to the backend. Is that correct?

If all nodes fail because they did not complete the test suite, then it would be hard to know which node executed failed tests. Perhaps it would be better to mark as failed nodes only these with failing tests and the rest of nodes could report as successful but with warning that the execution of tests has been canceled

Successful but with warning would work well for our use case!

ArturT commented 1 month ago

Do you specify the number of failures? What's the number in your case?

We don't specify a number (see more below).

Ok. At least one failed test example should put the CI build into failed state then. That's a default behaviour of --fail-fast when the number is not specified.

When the --fail-fast flag is useful to you? Is it when someone commits changes that break too many tests, and there is no point in running such a build?

Yep, that's right. We don't want to continue expending CI credits once we know that a given PR can't succeed.

Is it better to cancel it to save CI runtime costs? Are there other cases?

The problem with canceling is that, at least on CircleCI, there's no way to end the workflow in a failed state. It's important to end in a failed state for our metrics. This is also why the suggested workaround won't work.

Where do you track these metrics? In CircleCI?

All nodes should be terminated if the certain number of tests fails across all nodes.

My understanding is that, with --fail-fast set, Knapsack Pro will break out of the loop that pulls down more tests from the queue, but it will not communicate that to the backend. Is that correct?

Yes. That's why other nodes might end up running the rest of the test suite. This can lead to slower build because nodes with failed tests would exit early.

ArturT commented 1 month ago

Here is a working solution to make --fail-fast working across parallel nodes on CircleCI:

Create a bin/ci_rspec_tests script and ensure the Knapsack Pro command has --fail-fast set.

#!/bin/bash

bundle exec rake "knapsack_pro:queue:rspec[--format documentation --format RspecJunitFormatter --out /tmp/test-results/rspec.xml --fail-fast]"
export TESTS_EXIT_CODE=$?

if [ "$TESTS_EXIT_CODE" -ne "0" ]; then
  echo "Consume tests from the Knapsack Pro Queue API quickly so that other CI nodes stop working sooner. This allows for a fail-fast option across parallel nodes."
  bin/knapsack_fail_fast
fi

Create a bin/knapsack_fail_fast script:

#!/usr/bin/env ruby
require_relative '../config/boot'
require 'knapsack_pro'

# uncomment the following line for your test runner
ENV['KNAPSACK_PRO_TEST_SUITE_TOKEN'] = KnapsackPro::Config::Env.test_suite_token_rspec
#ENV['KNAPSACK_PRO_TEST_SUITE_TOKEN'] = KnapsackPro::Config::Env.test_suite_token_minitest
#ENV['KNAPSACK_PRO_TEST_SUITE_TOKEN'] = KnapsackPro::Config::Env.test_suite_token_cucumber

ENV['KNAPSACK_PRO_FIXED_QUEUE_SPLIT'] = 'false'

def fetch_test_files_from_queue
  action = KnapsackPro::Client::API::V1::Queues.queue(
    fixed_queue_split: ENV['KNAPSACK_PRO_FIXED_QUEUE_SPLIT'],
    can_initialize_queue: false,
    attempt_connect_to_queue: false,
    commit_hash: KnapsackPro::Config::Env.commit_hash,
    branch: KnapsackPro::Config::Env.branch,
    node_total: KnapsackPro::Config::Env.ci_node_total,
    node_index: KnapsackPro::Config::Env.ci_node_index,
  )
  connection = KnapsackPro::Client::Connection.new(action)
  response = connection.call
  if connection.success?
    raise ArgumentError.new(response) if connection.errors?
    KnapsackPro::TestFilePresenter.paths(response['test_files'])
  else
    raise ArgumentError.new("Couldn't connect with Knapsack Pro API. Response: #{response}")
  end
end

loop do
  batch_of_test_files = fetch_test_files_from_queue
  puts batch_of_test_files.inspect
  break if batch_of_test_files.empty?
end

exit 1

Update your CircleCI yml file .circleci/config.yml:

      - run:
          name: Run tests
          command: |
            if [ "$CIRCLE_NODE_TOTAL" = "1" ]; then
              mkdir /tmp/test-results
              circleci tests glob "spec/**/*_spec.rb" | circleci tests run --command="xargs bundle exec rspec --format documentation --format RspecJunitFormatter --out /tmp/test-results/rspec.xml" --verbose --split-by=timings
            else
              # https://docs.knapsackpro.com/ruby/circleci/
              mkdir -p /tmp/test-results
              export KNAPSACK_PRO_RSPEC_SPLIT_BY_TEST_EXAMPLES=true
              export KNAPSACK_PRO_TEST_FILE_LIST_SOURCE_FILE=/tmp/tests_to_run.txt
              circleci tests glob "spec/**/*_spec.rb" | circleci tests run --index 0 --total 1 --command ">$KNAPSACK_PRO_TEST_FILE_LIST_SOURCE_FILE xargs -n1 echo" --verbose
              bin/ci_rspec_tests
            fi

The above example uses CircleCI Rerun failed tests feature: https://docs.knapsackpro.com/ruby/circleci/#rerun-only-failed-tests

If you don't use it, then you can use this:

      - run:
          name: Run tests
          command: |
            mkdir /tmp/test-results
            bin/ci_rspec_tests

mavenraven commented 1 month ago

Here is a working solution to make --fail-fast working across parallel nodes on CircleCI:

Create a bin/ci_rspec_tests script and ensure the Knapsack Pro command has --fail-fast set.

#!/bin/bash

bundle exec rake "knapsack_pro:queue:rspec[--format documentation --format RspecJunitFormatter --out /tmp/test-results/rspec.xml --fail-fast]"
export TESTS_EXIT_CODE=$?

if [ "$TESTS_EXIT_CODE" -ne "0" ]; then
  echo "Consume tests from the Knapsack Pro Queue API quickly so that other CI nodes stop working sooner. This allows for a fail-fast option across parallel nodes."
  bin/knapsack_fail_fast
fi

Create a bin/knapsack_fail_fast script:

#!/usr/bin/env ruby
require_relative '../config/boot'
require 'knapsack_pro'

# uncomment the following line for your test runner
ENV['KNAPSACK_PRO_TEST_SUITE_TOKEN'] = KnapsackPro::Config::Env.test_suite_token_rspec
#ENV['KNAPSACK_PRO_TEST_SUITE_TOKEN'] = KnapsackPro::Config::Env.test_suite_token_minitest
#ENV['KNAPSACK_PRO_TEST_SUITE_TOKEN'] = KnapsackPro::Config::Env.test_suite_token_cucumber

ENV['KNAPSACK_PRO_FIXED_QUEUE_SPLIT'] = 'false'

def fetch_test_files_from_queue
  action = KnapsackPro::Client::API::V1::Queues.queue(
    fixed_queue_split: ENV['KNAPSACK_PRO_FIXED_QUEUE_SPLIT'],
    can_initialize_queue: false,
    attempt_connect_to_queue: false,
    commit_hash: KnapsackPro::Config::Env.commit_hash,
    branch: KnapsackPro::Config::Env.branch,
    node_total: KnapsackPro::Config::Env.ci_node_total,
    node_index: KnapsackPro::Config::Env.ci_node_index,
  )
  connection = KnapsackPro::Client::Connection.new(action)
  response = connection.call
  if connection.success?
    raise ArgumentError.new(response) if connection.errors?
    KnapsackPro::TestFilePresenter.paths(response['test_files'])
  else
    raise ArgumentError.new("Couldn't connect with Knapsack Pro API. Response: #{response}")
  end
end

loop do
  batch_of_test_files = fetch_test_files_from_queue
  puts batch_of_test_files.inspect
  break if batch_of_test_files.empty?
end

exit 1

Update your CircleCI yml file .circleci/config.yml:

      - run:
          name: Run tests
          command: |
            if [ "$CIRCLE_NODE_TOTAL" = "1" ]; then
              mkdir /tmp/test-results
              circleci tests glob "spec/**/*_spec.rb" | circleci tests run --command="xargs bundle exec rspec --format documentation --format RspecJunitFormatter --out /tmp/test-results/rspec.xml" --verbose --split-by=timings
            else
              # https://docs.knapsackpro.com/ruby/circleci/
              mkdir -p /tmp/test-results
              export KNAPSACK_PRO_RSPEC_SPLIT_BY_TEST_EXAMPLES=true
              export KNAPSACK_PRO_TEST_FILE_LIST_SOURCE_FILE=/tmp/tests_to_run.txt
              circleci tests glob "spec/**/*_spec.rb" | circleci tests run --index 0 --total 1 --command ">$KNAPSACK_PRO_TEST_FILE_LIST_SOURCE_FILE xargs -n1 echo" --verbose
              bin/ci_rspec_tests
            fi

The above example uses CircleCI Rerun failed tests feature: https://docs.knapsackpro.com/ruby/circleci/#rerun-only-failed-tests

If you don't use it, then you can use this:

      - run:
          name: Run tests
          command: |
            mkdir /tmp/test-results
            bin/ci_rspec_tests

Thanks @ArturT! We'll give this a shot.

ArturT commented 3 weeks ago

@mavenraven Have you tested it?

mavenraven commented 3 weeks ago

@ArturT Nope, we haven't had a chance yet. I'll let you know when we do!

mavenraven commented 1 week ago

Hey @ArturT! I was able to test out the solution you provided yesterday, and the behavior is exactly what was hoped for. Thanks for that!

Ideally, we wouldn't be responsible for maintaining the logic in knapsack_fail_fast as we would need to inject the script into all of our pipelines and it could break when new versions of the gem are released. What do you think about adding something like --fail-fast-across-nodes to rake knapsack_pro:queue:rspec to encapsulate it?

Alternatively, perhaps the Knapsack Pro API could provide something like POST /queues/:queue_name/cancel and the ability to optionally set the queue name when calling POST /queues/queues, combined with a switch like --queue-name to rake knapsack_pro:queue:rspec?

Thanks again for your help!

ArturT commented 1 week ago

I'm happy it works for you. Can you continue using it for the time being?

In the meantime, we can gather more feedback from other users interested in the fail-fast feature to help shape the solution.

mavenraven commented 1 week ago

@ArturT The work I did was only a proof of concept, so we'll wait until the more holistic solution is ready. Feel free to ping me if you need any other information as you talk to other users.

ArturT commented 6 days ago

I can’t promise if or when we’ll decide to implement this feature. However, I’m tracking it in our backlog and will keep you updated if we make any progress.

In the meantime, please reconsider using the workaround provided: https://github.com/KnapsackPro/knapsack_pro-ruby/issues/268#issuecomment-2285069109

I appreciate your understanding.

KnapsackPro / knapsack_pro-ruby

feature request: fail fast across nodes #268