Closed CMCDragonkai closed 2 years ago
It's also interesting that we are hitting testing-scalability problems. We may need to simply invest in bigger computers too to be able to do the tests well.
@joshuakarp @tegefaulkes
We should get all tests that use the WorkerManager
and possibly separate them from the main tests. So right now test/keys/KeyManager.test.ts
tests with WorkerManager
as well. We might want to have a separate file tests/keys/KeyManagerWorkers.test.ts
instead, this can allow us to use a pattern to separately test them.
It is no longer feasible to expect devs to always be doing full npm test -- --runInBand
on every MR. It takes too long, parallel testing would help, but time out errors prevent that.
To ensure development tempo, we will need to solve this problem. The number of tests are only going to grow as the application gets larger.
We may need to do a sort of hierarchy. Invest in bigger CI/CD machines that can do full integration testing that uses many cores and many machines, while devs continue to focus on domain-specific testing. That way devs can work on weaker machines and centralise the computational effort.
We should also identify slow tests and a proper threshold. All tests should complete sooner than 5 seconds, any tests larger than that should be looked into broken down. When a test cannot be broken down, it needs to be isolated and separated into a separate section where they are longer tests.
Certain domains are inherently "integrative". The nodes
domain in particular integrates many parts of the system and this involves quite heavy tests. It would ideal to separate the code, so that utility functions can be checked independently of full nodes related testing. See how keys
domain separates KeyManager
and the keys/utils
. @joshuakarp
The nodes/NodeManager.test.ts
is incredibly slow. One test takes 81 seconds. Why does this test take so long?
PASS tests/nodes/NodeManager.test.ts (157.301 s)
NodeManager
✓ pings node (81434 ms)
✓ finds node (local) (3653 ms)
✓ finds node (contacts remote node) (6441 ms)
✓ cannot find node (contacts remote node) (28109 ms)
✓ knows node (true and false case) (3027 ms)
getConnectionToNode
✓ creates new connection to node (5612 ms)
✓ gets existing connection to node (5966 ms)
✓ concurrent connection creation to same target results in 1 connection (5820 ms)
Cross signing claims
✓ can successfully cross sign a claim (4895 ms)
pings node
has that crazy time mostly because it's checking the failure case too (where a node is offline). There's a few cases of this:
In general though, NodeManager.test.ts
is also slow because it hasn't been migrated to using only a single PolykeyAgent
across the larger integration tests. This will need to be done at some stage.
Can we separate these tests, that would enable concurrent execution.
In addition to this, jest has the ability to mock timers. https://jestjs.io/docs/timer-mocks and we should be setting the timeouts to be much shorter in our test cases compared to production.
Right now running domain level tests take too long at least for nodes.
On 27 October 2021 9:33:33 am AEDT, Josh @.***> wrote:
pings node
has that crazy time mostly because it's checking the failure case too (where a node is offline). There's a few cases of this:
- node is offline: this takes approximately 20 seconds for the connection attempt to timeout (and therefore, the ping to fail)
- existing connection goes offline: I believe it requires approximately 30 seconds for an existing connection to be "dropped" in the networking domain (related to the keep-alive packets?)
In general though,
NodeManager.test.ts
is also slow because it hasn't been migrated to using only a singlePolykeyAgent
across the larger integration tests. This will need to be done at some stage.-- You are receiving this because you were assigned. Reply to this email directly or view it on GitHub: https://github.com/MatrixAI/js-polykey/issues/264#issuecomment-952379158 -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Note for myself, when splitting the nodes tests and making these more efficient (particularly the ping
test):
ForwardProxy
timeouts. I looked into this before, but the injected timeouts didn't seem to work as expected (i.e. setting the timer didn't seem to adjust the timeout time for a connection attempt):static async createForwardProxy({
authToken,
connConnectTime = 20000,
connTimeoutTime = 20000,
connPingIntervalTime = 1000,
logger,
}: {
authToken: string;
connConnectTime?: number;
connTimeoutTime?: number;
connPingIntervalTime?: number;
logger?: Logger;
}): Promise<ForwardProxy> {
When refactoring the tests, ideally our domain-level tests should always aim to be unit tests (i.e. minimal tests on the functionality, and that mock things that are required to be injected/used). Timer mocks should also be looked into (see above https://github.com/MatrixAI/js-polykey/issues/264#issuecomment-952419944) such that we don't have to wait for the full timeout expected in production (e.g. connection timeouts especially).
Any longer/integration tests should be separately executed, such that they aren't executed in the same test suite as these smaller unit tests. We should look into jest's tagging solution (https://www.npmjs.com/package/jest-runner-groups) that would allow us to tag via comments and run tests across all domains, whilst separately executing the longer integration tests. This way we wouldn't have to keep them in separate directories either.
This would most likely be a good issue for @emmacasolin to act in a supporting role too once she's back from leave.
- further investigate the existing
ForwardProxy
timeouts. I looked into this before, but the injected timeouts didn't seem to work as expected (i.e. setting the timer didn't seem to adjust the timeout time for a connection attempt):
I think I've worked out why this wasn't working - it looks like the timeout needs to be set on both the forward proxy of the node doing the pinging AND the reverse proxy of the node being pinged. Setting the timeout on both ends to one second reduces the time the nodes ping test takes to just under a minute (a reduction of 20-30 seconds), however the need to start, stop, and restart the remote node during the test is keeping the test time high. Moving the initial setup and destruction of the remote node to before/after blocks cuts off another 10 seconds, but if we move to a different method of mocking keynodes then this additional setup time shouldn't be a problem anyway.
The field I'm setting is connTimeoutTime
on both the fwd proxy of the pinging node and the rev proxy of the pinged node. connConnectTime
and connPingIntervalTime
don't seem to make any difference to the time the test takes to complete.
Mocking the reverse proxy so that we don't even use a remote keynode brings the test time down to 10ms! The question is does doing it this way test everything we need it to? If we assume all the connection-side stuff is being tested in the network tests then maybe this is fine? It may also be possible to mock something a little further along in the call so that more of the actual functionality is tested.
Mocking the reverse proxy so that we don't even use a remote keynode brings the test time down to 10ms! The question is does doing it this way test everything we need it to? If we assume all the connection-side stuff is being tested in the network tests then maybe this is fine? It may also be possible to mock something a little further along in the call so that more of the actual functionality is tested.
Will take a look at this after seed nodes.
Tests involving a global agent has been done in tests/bin/utils.ts
. This is applicable only to tests/bin
atm. This means bin tests will share the same global agent where possible. Due to test worker pools, we have also ensured that all test modules serialises their access to the global agent by way of a lock. This avoids having to create lots of polykey agents.
The limiting factor is that creating a polykey agent is expensive. Mostly due to key generation process among many other factors like creating network servers... etc. If the key generation process wasn't that expensive, then we could just create new polykey agents each time we wanted to test something. Here are the most expensive things that happen when creating a polykey agent (probably in order of cost):
WorkerManager
- multiprocessing poolKey generation process might be improved with #168.
For now though, other tests domains also make use of potentially multiple polykey agents. Rather than having all tests synchronise on one global agent. Each test domain can make their own decision here and create their "scoped" global PK agent. This could take place in these forms:
beforeAll
- Sharing a PK agent in beforeAll
for a test module across all test functions. - Better but still some what SLOW if there are lots of test modules. Note that all test functions are serialised.For 3. the trick is to ensure that the same PK agent is shared across the jest worker pool which requires that all jest workers agree on the same "directory" for the node path.
Right now I've done this through tests/globalSetup.ts
and tests/globalTeardown.ts
and tests/setup.ts
, however it is now creating a problem when running the jest
on multiple terminals. https://github.com/MatrixAI/js-polykey/pull/292#issuecomment-992474418
There's a solution to using globals
in the jest.config.ts
described here: https://github.com/facebook/jest/issues/9037#issuecomment-696322001. And that would give ensure that the node path is indeed shared as well as unique to each execution of npm test
or jest
because jest.config.ts
is itself a script.
But this also means that all relevant test domains need to be aware of which global we are using.
So the priorities are:
jest.config.ts
to create truly shared global state between all worker pools. For global agents that can be shared with method 3. And also solves the multiple-terminal problem.Each domain needs to indicate their strategy here.
Doing the above will also potentially refactor the test/utils.ts
functions of:
setupRemoteKeynode
cleanupRemoteKeynode
addRemoteDetails
As these are primarily used to deal with remote keynodes. We would want to standardise on the logic we have setup for pkAgent
on test/bin/utils.ts.
Things like using the test provider and ensuring that works too would be relevant as per #278.
Note that by mocking the keypair generation creating polykey agents are alot faster. So much so that in many cases the global agent is not necessary. I've just done this in tests/PolykeyClient.test.ts
.
For sharing a global agent across a test directory, refer to the new setupGlobalAgent
utility described here: https://github.com/MatrixAI/js-polykey/pull/292#issuecomment-994415690
My tests show that with mocked global keypair starting and stopping the PK agent takes about 3 seconds.
[nix-shell:~/Projects/js-polykey]$ npm test -- tests/PolykeyAgent.test.ts
> @matrixai/polykey@1.0.0 test /home/cmcdragonkai/Projects/js-polykey
> jest "tests/PolykeyAgent.test.ts"
Determining test suites to run...
GLOBAL SETUP
Global Data Dir: /tmp/polykey-test-global-xltUTl
PASS tests/PolykeyAgent.test.ts (13.662 s)
PolykeyAgent
✓ PolykeyAgent readiness (2993 ms)
✓ creates node path (2833 ms)
Test Suites: 1 passed, 1 total
Tests: 2 passed, 2 total
Snapshots: 0 total
Time: 13.696 s, estimated 14 s
Ran all test suites matching /tests\/PolykeyAgent.test.ts/i.
GLOBAL TEARDOWN
Destroying Global Data Dir: /tmp/polykey-test-global-xltUTl
So indeed for alot of tests it would be sufficient to create their own pk agent at least for sharing in a test module without having to share the full global agent.
This has been achieved in a number of ways.
pkSpawn
should take about 4 seconds on local machine. Simply because our local computers are better.KeyManager
.pkSpawn
. If you're able to mock keypairs and start your own agents, it's much faster to do so.jest
tests.tests/bin
.globalThis.maxTimeout
and globalThis.defaultTimeout
. The other 2 timeouts should be deprecated and removed and replaced with these 2. Do not have intermediate timeouts inside the test functions, this is not a good idea, and we should expect good performance in the entire test function.
Specification
Right now running all the tests multi-core results in tests timing out.
This is simply due to sheer size of the tests. Asynchronous tests have timeouts applied to each test.
A quick solution would be remove timeouts for some of our asynchronous tests, but then we don't know if the tests are making progress.
There are subsection of tests that involve OS side-effects and these seem the source of concurrent test timeouts (probably because they are blocked on a OS-kernel/syscall and the OS is overloaded and thus cannot return in time):
tests/bootstrap/bootstrap.test.ts
tests/agent/utils.test.ts
tests/bin/agent.test.ts
These could involve filesystem side effects, process lifecycle side effects and locking side-effects.
If the OS gets overloaded, these things can slow down as we rely on external system, the OS is essentially slowed down and therefore the tests get timed out.
Right now we are forced to do
npm test -- --runInBand
which slows down testing considerably. Ideally we can use all the cores.Additional context
238 - refactor agent spawning tests, speed of process execution is going to be important here
250 - parallelising for CI/CD
Tasks
beforeAll
to share resources like objects, but these can be conflicting withbeforeAll
used in other test files, so these need to be managed properlydescribe
, alsotest.concurrent
can be used within a single test file to do concurrent testing, but it should be compared with justPromise.all
toojest
.