Shopify / ci-queue

Distribute tests over many workers using a queue
MIT License
171 stars 27 forks source link

Using a Redis server that requires SSL connection with self signed certificate exits 0 with 0 tests run #292

Open schneems opened 4 days ago

schneems commented 4 days ago

The issue is described in my debugging notes here https://github.com/heroku/heroku-buildpack-ruby/issues/1505#issuecomment-2469218290

Context

Heroku rolled out a SSL connections by default https://devcenter.heroku.com/changelog-items/2992. However all certificates are self-signed. (as the servers do not "own" the domain compute-1.amazonaws.com). Therefore we recommend people disable peer verification https://devcenter.heroku.com/articles/connecting-heroku-redis#connecting-in-ruby.

When this change happened, my CI was reporting that my runs were successful however I never noticed that they were executing 0 tests:

-----> Running test command `bundle exec rspec-queue --max-requeues=3 --timeout 180 --queue $REDIS_URL --format documentation || { cat log/test_order.log;  $(exit 1); }`...
Finished in 9.72 seconds (files took 20.02 seconds to load)
0 examples, 0 failures
-----> test command `bundle exec rspec-queue --max-requeues=3 --timeout 180 --queue $REDIS_URL --format documentation || { cat log/test_order.log;  $(exit 1); }` completed successfully

While debugging, I tried connecting to a legacy non-ssl url (REDIS_TEMPORARY_URL) and when I made that change everything worked fine.

Expected

When I try to connect to a redis instance and the connection fails for some reason that it will exit non-zero.

Actual

It reports 0 tests were executed and then does not fail

Suggestion

Ideally if the runner cannot connect to redis (for any reason) it would hard fail and ideally give more connection information. I understand that rspec-queue is marked deprecated in the readme, however I'm guessing that this behavior might also persist in other shared connection code-paths.

Beyond improving the connection failure message, if an flag or env var could be added to set the verification mode to none for use with hosted/self-signed certs, that would be helpful as I have no other way of working around this problem at this time.

casperisfine commented 3 days ago

This behavior is actually by design. ci-queue is designed so that all workers are expendable and soft failing except for actual test failures. This include not being able to reach Redis.

As such, the reporter step isn't optional, it's 100% required as it's the only thing that ensure all queued tests were ran.

This is definitely not documented enough, and you are not the first one to get bit by this.

What we could probably do to is to make this opt-in with some sort of --soft-failure-code 0 or something, as this was done at a time where our CI system didn't support registering some error code for soft failure.

if an flag or env var could be added to set the verification mode to none

That's a different feature request.

casperisfine commented 3 days ago

What we could probably do to is to make this opt-in with some sort of --soft-failure-code 0 or something

@ChrisBr any opinion?

schneems commented 3 days ago

Thanks for the extra context. In my case such a flag would meet my needs for increasing confidence. I'm running a small cluster, 16 nodes, and I don't expect any of them to fail or be unavailable.

all workers are expendable and soft failing except for actual test failures

CAP strikes again I guess. It makes sense that you would want your distributed job to allow for failure of nodes.

This include not being able to reach Redis.

Short of adding a new feature, lower hanging fruit could be to emit a warning saying when the worker is going into a soft-failure mode, ideally along with why to differentiate failure modes. Possibly to link to some docs explaining why it's not exiting 0 i.e. highlighting that it's designed with expendable test workers in mind. Basically, raise awareness of the issue.

That's a different feature request.

👍 I'm going to poke at it a bit. I'll be at RubyConf this week, so probably won't have a PR until after that. Thanks for taking a look at the issue.