I am not clear on what the right solution here is. I would very much appreciate thoughts and feedback and ideas.
Our CI can't continue as it currently stands.
There are several problems:
test runs take forever to complete
they randomly fail (often the fix is to rerun a test, which sucks)
CI environment is complex
The random failures come in multiple categories:
builds fail because of some dependency not being available or randomly failing for some other reason (say gitbook compile)
tests failing due to having some timing based component or otherwise being nondeterministic
The slowness comes from:
tests being slow by themselves. Has some cruft been accidentally introduced that has made them slower?
the complexity of our CI environment?
Our custom CI environment (nearly) seemed a good idea once upon a time. It replaced Travis CI which at the time seemed to be the standard. The reason we replaced it was because it was incredibly slow and frequently failed... maybe the real problem here isn't really the CI environment as much as our tests themselves?
Any suggestions here are very welcome!
I think our upcoming testing phase is a good time to try to tackle this. I would like for us not to have to go into another release with things as they stand.
I think as a good first step we should identify which tests are flakey.
This could then provide a little laundry list of things to improve.
In the short term, we could add a tag to these and re-rerun them a few times if they fail.
Not all changes need all tests to run. For example if one makes changes only to Air, then only the air tests and integration tests need to run. Similarly, cloak changes could only run integration, cloak and compliance.
We could trade some more complexity for speed by using the built in change tracking in ExUnit which only reruns tests that could have potentially changed since last time.
We could try to switch CI setups in some way. But it is going to be rather labour intensive with a somewhat uncertain payoff IMO.
We should make a way to run a full CI build locally. This sometimes can save a lot of time debugging CI issues.
I am not clear on what the right solution here is. I would very much appreciate thoughts and feedback and ideas.
Our CI can't continue as it currently stands. There are several problems:
The random failures come in multiple categories:
The slowness comes from:
Our custom CI environment (nearly) seemed a good idea once upon a time. It replaced Travis CI which at the time seemed to be the standard. The reason we replaced it was because it was incredibly slow and frequently failed... maybe the real problem here isn't really the CI environment as much as our tests themselves?
Any suggestions here are very welcome! I think our upcoming testing phase is a good time to try to tackle this. I would like for us not to have to go into another release with things as they stand.