The threshold was introduced to accomodate for "random errors" (e.g. browserstack queue full, or random puppeteer timeouts), so as not to get noisy tests.
A better solution would catch and handle specific errors (like the browserstack queue one) and handle them specifically. For example, we could retry tests if it's a queue timeout. Or have specific types of errors that we do allow to fail. And fail for all other errors.
The threshold was introduced to accomodate for "random errors" (e.g. browserstack queue full, or random puppeteer timeouts), so as not to get noisy tests.
However it's not a great solution. In cases like https://circleci.com/workflow-run/fa1e2bc6-450b-4c3c-b648-40f51f0580b1, there's only one smoke tests, and it "errored" for a genuine problem - however the tests passed because it was within the threshold.
A better solution would catch and handle specific errors (like the browserstack queue one) and handle them specifically. For example, we could retry tests if it's a queue timeout. Or have specific types of errors that we do allow to fail. And fail for all other errors.