Fix Concurrent Testing Scalability - Find Blocking Code, Make them Non-Blocking

CMCDragonkai commented 3 years ago

Specification

Right now running all the tests multi-core results in tests timing out.

This is simply due to sheer size of the tests. Asynchronous tests have timeouts applied to each test.

A quick solution would be remove timeouts for some of our asynchronous tests, but then we don't know if the tests are making progress.

There are subsection of tests that involve OS side-effects and these seem the source of concurrent test timeouts (probably because they are blocked on a OS-kernel/syscall and the OS is overloaded and thus cannot return in time):

tests/bootstrap/bootstrap.test.ts
tests/agent/utils.test.ts
tests/bin/agent.test.ts

These could involve filesystem side effects, process lifecycle side effects and locking side-effects.

If the OS gets overloaded, these things can slow down as we rely on external system, the OS is essentially slowed down and therefore the tests get timed out.

Right now we are forced to do npm test -- --runInBand which slows down testing considerably. Ideally we can use all the cores.

Additional context

https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/205/#note_712971975 - where these were discovered
238 - refactor agent spawning tests, speed of process execution is going to be important here
250 - parallelising for CI/CD

Tasks

Identify all blocking side-effects in these tests, and rank them by the most expensive to the least expensive
Find ways of optimising these side-effects, perhaps by mocking, or using a faster alternative
Consider if these particular kinds of tests should be isolated from the rest of the system when the OS is overloaded, in that case, should these tests have a much longer, or no timeout what so-ever?
Find out if these tests are interfering with each other in terms of locks, and eliminate these.
What is the load exactly? Is it CPU-load or is it IO-load that's slowing things down? If it is IO-load, is it IO ops or IO bandwidth? If IO ops, is it kernel-related functionality or otherwise?
Benchmarking might need to take place.
Consider "splitting" tests and nesting tests, to create structure where these tests won't timeout
Only primitives and POJOs can be shared between test files, within a testfile, use beforeAll to share resources like objects, but these can be conflicting with beforeAll used in other test files, so these need to be managed properly
We could use 1 testfile to synchronise relevant tests, and then test describe, also test.concurrent can be used within a single test file to do concurrent testing, but it should be compared with just Promise.all too
One thing I used to do when there used to be interference is to separate a family of tests to only be done after all the other tests were done, this can be done simply as 2 separate calls to jest.

CMCDragonkai commented 3 years ago

It's also interesting that we are hitting testing-scalability problems. We may need to simply invest in bigger computers too to be able to do the tests well.

CMCDragonkai commented 3 years ago

@joshuakarp @tegefaulkes

CMCDragonkai commented 3 years ago

We should get all tests that use the WorkerManager and possibly separate them from the main tests. So right now test/keys/KeyManager.test.ts tests with WorkerManager as well. We might want to have a separate file tests/keys/KeyManagerWorkers.test.ts instead, this can allow us to use a pattern to separately test them.

CMCDragonkai commented 3 years ago

It is no longer feasible to expect devs to always be doing full npm test -- --runInBand on every MR. It takes too long, parallel testing would help, but time out errors prevent that.

To ensure development tempo, we will need to solve this problem. The number of tests are only going to grow as the application gets larger.

We may need to do a sort of hierarchy. Invest in bigger CI/CD machines that can do full integration testing that uses many cores and many machines, while devs continue to focus on domain-specific testing. That way devs can work on weaker machines and centralise the computational effort.

CMCDragonkai commented 3 years ago

We should also identify slow tests and a proper threshold. All tests should complete sooner than 5 seconds, any tests larger than that should be looked into broken down. When a test cannot be broken down, it needs to be isolated and separated into a separate section where they are longer tests.

Certain domains are inherently "integrative". The nodes domain in particular integrates many parts of the system and this involves quite heavy tests. It would ideal to separate the code, so that utility functions can be checked independently of full nodes related testing. See how keys domain separates KeyManager and the keys/utils. @joshuakarp

CMCDragonkai commented 3 years ago

The nodes/NodeManager.test.ts is incredibly slow. One test takes 81 seconds. Why does this test take so long?

 PASS  tests/nodes/NodeManager.test.ts (157.301 s)
  NodeManager
    ✓ pings node (81434 ms)
    ✓ finds node (local) (3653 ms)
    ✓ finds node (contacts remote node) (6441 ms)
    ✓ cannot find node (contacts remote node) (28109 ms)
    ✓ knows node (true and false case) (3027 ms)
    getConnectionToNode
      ✓ creates new connection to node (5612 ms)
      ✓ gets existing connection to node (5966 ms)
      ✓ concurrent connection creation to same target results in 1 connection (5820 ms)
    Cross signing claims
      ✓ can successfully cross sign a claim (4895 ms)

joshuakarp commented 3 years ago

pings node has that crazy time mostly because it's checking the failure case too (where a node is offline). There's a few cases of this:

node is offline: this takes approximately 20 seconds for the connection attempt to timeout (and therefore, the ping to fail)
existing connection goes offline: I believe it requires approximately 30 seconds for an existing connection to be "dropped" in the networking domain (related to the keep-alive packets?)

In general though, NodeManager.test.ts is also slow because it hasn't been migrated to using only a single PolykeyAgent across the larger integration tests. This will need to be done at some stage.

CMCDragonkai commented 3 years ago

Can we separate these tests, that would enable concurrent execution.

In addition to this, jest has the ability to mock timers. https://jestjs.io/docs/timer-mocks and we should be setting the timeouts to be much shorter in our test cases compared to production.

Right now running domain level tests take too long at least for nodes.

On 27 October 2021 9:33:33 am AEDT, Josh @.***> wrote:

pings node has that crazy time mostly because it's checking the failure case too (where a node is offline). There's a few cases of this:

node is offline: this takes approximately 20 seconds for the connection attempt to timeout (and therefore, the ping to fail)

existing connection goes offline: I believe it requires approximately 30 seconds for an existing connection to be "dropped" in the networking domain (related to the keep-alive packets?)

In general though, NodeManager.test.ts is also slow because it hasn't been migrated to using only a single PolykeyAgent across the larger integration tests. This will need to be done at some stage.

-- You are receiving this because you were assigned. Reply to this email directly or view it on GitHub: https://github.com/MatrixAI/js-polykey/issues/264#issuecomment-952379158 -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

joshuakarp commented 3 years ago

Note for myself, when splitting the nodes tests and making these more efficient (particularly the ping test):

further investigate the existing ForwardProxy timeouts. I looked into this before, but the injected timeouts didn't seem to work as expected (i.e. setting the timer didn't seem to adjust the timeout time for a connection attempt):

static async createForwardProxy({
    authToken,
    connConnectTime = 20000,
    connTimeoutTime = 20000,
    connPingIntervalTime = 1000,
    logger,
  }: {
    authToken: string;
    connConnectTime?: number;
    connTimeoutTime?: number;
    connPingIntervalTime?: number;
    logger?: Logger;
  }): Promise<ForwardProxy> {

investigate mocking timers with jest https://jestjs.io/blog/2020/05/05/jest-26#new-fake-timers

joshuakarp commented 3 years ago

When refactoring the tests, ideally our domain-level tests should always aim to be unit tests (i.e. minimal tests on the functionality, and that mock things that are required to be injected/used). Timer mocks should also be looked into (see above https://github.com/MatrixAI/js-polykey/issues/264#issuecomment-952419944) such that we don't have to wait for the full timeout expected in production (e.g. connection timeouts especially).

Any longer/integration tests should be separately executed, such that they aren't executed in the same test suite as these smaller unit tests. We should look into jest's tagging solution (https://www.npmjs.com/package/jest-runner-groups) that would allow us to tag via comments and run tests across all domains, whilst separately executing the longer integration tests. This way we wouldn't have to keep them in separate directories either.

joshuakarp commented 2 years ago

This would most likely be a good issue for @emmacasolin to act in a supporting role too once she's back from leave.

emmacasolin commented 2 years ago

further investigate the existing ForwardProxy timeouts. I looked into this before, but the injected timeouts didn't seem to work as expected (i.e. setting the timer didn't seem to adjust the timeout time for a connection attempt):

I think I've worked out why this wasn't working - it looks like the timeout needs to be set on both the forward proxy of the node doing the pinging AND the reverse proxy of the node being pinged. Setting the timeout on both ends to one second reduces the time the nodes ping test takes to just under a minute (a reduction of 20-30 seconds), however the need to start, stop, and restart the remote node during the test is keeping the test time high. Moving the initial setup and destruction of the remote node to before/after blocks cuts off another 10 seconds, but if we move to a different method of mocking keynodes then this additional setup time shouldn't be a problem anyway.

The field I'm setting is connTimeoutTime on both the fwd proxy of the pinging node and the rev proxy of the pinged node. connConnectTime and connPingIntervalTime don't seem to make any difference to the time the test takes to complete.

emmacasolin commented 2 years ago

Mocking the reverse proxy so that we don't even use a remote keynode brings the test time down to 10ms! The question is does doing it this way test everything we need it to? If we assume all the connection-side stuff is being tested in the network tests then maybe this is fine? It may also be possible to mock something a little further along in the call so that more of the actual functionality is tested.

```ts import type { CertificatePem, KeyPairPem } from '../src/keys/types'; import type { Host, Port } from '../src/network/types'; import type { NodeAddress } from '../src/nodes/types'; import os from 'os'; import path from 'path'; import fs from 'fs'; import Logger, { LogLevel, StreamHandler } from '@matrixai/logger'; import { DB } from '@matrixai/db'; import { KeyManager, utils as keysUtils } from '../src/keys'; import { NodeManager } from '../src/nodes'; import { ForwardProxy, ReverseProxy } from '../src/network'; import { Sigchain } from '../src/sigchain'; import { makeNodeId } from '../src/nodes/utils'; import { makeCrypto } from './utils'; import { ErrorConnectionStart } from '@/errors'; const offline = new ErrorConnectionStart(); const mockValue = jest.fn().mockRejectedValueOnce(offline).mockResolvedValue(null); jest.mock('../src/network/ForwardProxy', () => { return jest.fn().mockImplementationOnce(() => { return {openConnection: mockValue} }); }); describe('NodeManager', () => { const password = 'password'; const logger = new Logger('NodeManagerTest', LogLevel.WARN, [ new StreamHandler(), ]); let dataDir: string; let nodeManager: NodeManager; let fwdProxy: ForwardProxy; let revProxy: ReverseProxy; let keyManager: KeyManager; let keyPairPem: KeyPairPem; let certPem: CertificatePem; let db: DB; let sigchain: Sigchain; const serverHost = '::1' as Host; const serverPort = 1 as Port; const nodeId1 = makeNodeId( 'vrsc24a1er424epq77dtoveo93meij0pc8ig4uvs9jbeld78n9nl0', ); beforeAll(async () => { dataDir = await fs.promises.mkdtemp( path.join(os.tmpdir(), 'polykey-test-'), ); const keysPath = `${dataDir}/keys`; keyManager = await KeyManager.createKeyManager({ password, keysPath, logger, }); const cert = keyManager.getRootCert(); keyPairPem = keyManager.getRootKeyPairPem(); certPem = keysUtils.certToPem(cert); fwdProxy = new ForwardProxy({ authToken: 'abc', connTimeoutTime: 1000, logger: logger, }); revProxy = new ReverseProxy({ logger: logger, }); await revProxy.start({ serverHost, serverPort, tlsConfig: { keyPrivatePem: keyPairPem.privateKey, certChainPem: certPem, }, }); const dbPath = `${dataDir}/db`; db = await DB.createDB({ dbPath, logger, crypto: makeCrypto(keyManager) }); sigchain = await Sigchain.createSigchain({ keyManager, db, logger }); nodeManager = await NodeManager.createNodeManager({ db, sigchain, keyManager, fwdProxy, revProxy, logger, }); await nodeManager.start(); }); afterAll(async () => { await nodeManager.stop(); await nodeManager.destroy(); await sigchain.stop(); await sigchain.destroy(); await db.stop(); await db.destroy(); await keyManager.stop(); await keyManager.destroy(); await revProxy.stop(); await fs.promises.rm(dataDir, { force: true, recursive: true, }); }); test( 'pings node', async () => { const serverNodeId = nodeId1; let serverNodeAddress: NodeAddress = { ip: serverHost, port: serverPort, }; await nodeManager.setNode(serverNodeId, serverNodeAddress); // Check if active // Case 1: cannot establish new connection, so offline const active1 = await nodeManager.pingNode(serverNodeId); expect(active1).toBe(false); // // Case 2: can establish new connection, so online const active2 = await nodeManager.pingNode(serverNodeId); expect(active2).toBe(true); } ); }); ``` ``` GLOBAL SETUP PASS tests/timer.test.ts (5.282 s) NodeManager ✓ pings node (10 ms) Test Suites: 1 passed, 1 total Tests: 1 passed, 1 total Snapshots: 0 total Time: 5.309 s Ran all test suites matching /timer/i. GLOBAL TEARDOWN ```

joshuakarp commented 2 years ago

Mocking the reverse proxy so that we don't even use a remote keynode brings the test time down to 10ms! The question is does doing it this way test everything we need it to? If we assume all the connection-side stuff is being tested in the network tests then maybe this is fine? It may also be possible to mock something a little further along in the call so that more of the actual functionality is tested.

Will take a look at this after seed nodes.

CMCDragonkai commented 2 years ago

Tests involving a global agent has been done in tests/bin/utils.ts. This is applicable only to tests/bin atm. This means bin tests will share the same global agent where possible. Due to test worker pools, we have also ensured that all test modules serialises their access to the global agent by way of a lock. This avoids having to create lots of polykey agents.

The limiting factor is that creating a polykey agent is expensive. Mostly due to key generation process among many other factors like creating network servers... etc. If the key generation process wasn't that expensive, then we could just create new polykey agents each time we wanted to test something. Here are the most expensive things that happen when creating a polykey agent (probably in order of cost):

Generating the root keypair
Constructing the WorkerManager - multiprocessing pool
Constructing the network and grpc servers
Anything else?

Key generation process might be improved with #168.

For now though, other tests domains also make use of potentially multiple polykey agents. Rather than having all tests synchronise on one global agent. Each test domain can make their own decision here and create their "scoped" global PK agent. This could take place in these forms:

In Test Function - Just create a PK agent under the test function itself that's isolated from all other test functions and other test modules - This is SLOW!
In beforeAll - Sharing a PK agent in beforeAll for a test module across all test functions. - Better but still some what SLOW if there are lots of test modules. Note that all test functions are serialised.
Global between test modules in a test directory - Sharing a PK agent for a directory of test modules, this needs to solve the worker-pool problem. - This is much faster. But remember you must make all test modules serialised in this case, otherwise worker pool will cause problems!

For 3. the trick is to ensure that the same PK agent is shared across the jest worker pool which requires that all jest workers agree on the same "directory" for the node path.

Right now I've done this through tests/globalSetup.ts and tests/globalTeardown.ts and tests/setup.ts, however it is now creating a problem when running the jest on multiple terminals. https://github.com/MatrixAI/js-polykey/pull/292#issuecomment-992474418

There's a solution to using globals in the jest.config.ts described here: https://github.com/facebook/jest/issues/9037#issuecomment-696322001. And that would give ensure that the node path is indeed shared as well as unique to each execution of npm test or jest because jest.config.ts is itself a script.

But this also means that all relevant test domains need to be aware of which global we are using.

So the priorities are:

Make use of the jest.config.ts to create truly shared global state between all worker pools. For global agents that can be shared with method 3. And also solves the multiple-terminal problem.
Do #168, so that our key generation can be faster, and perhaps we won't need to share the pk agent so much. Thus allowing us to write tests that are more isolated.

Each domain needs to indicate their strategy here.

CMCDragonkai commented 2 years ago

Doing the above will also potentially refactor the test/utils.ts functions of:

setupRemoteKeynode
cleanupRemoteKeynode
addRemoteDetails

As these are primarily used to deal with remote keynodes. We would want to standardise on the logic we have setup for pkAgent on test/bin/utils.ts.

Things like using the test provider and ensuring that works too would be relevant as per #278.

CMCDragonkai commented 2 years ago

Note that by mocking the keypair generation creating polykey agents are alot faster. So much so that in many cases the global agent is not necessary. I've just done this in tests/PolykeyClient.test.ts.

CMCDragonkai commented 2 years ago

For sharing a global agent across a test directory, refer to the new setupGlobalAgent utility described here: https://github.com/MatrixAI/js-polykey/pull/292#issuecomment-994415690

CMCDragonkai commented 2 years ago

My tests show that with mocked global keypair starting and stopping the PK agent takes about 3 seconds.

[nix-shell:~/Projects/js-polykey]$ npm test -- tests/PolykeyAgent.test.ts 

> @matrixai/polykey@1.0.0 test /home/cmcdragonkai/Projects/js-polykey
> jest "tests/PolykeyAgent.test.ts"

Determining test suites to run...
GLOBAL SETUP
Global Data Dir: /tmp/polykey-test-global-xltUTl
 PASS  tests/PolykeyAgent.test.ts (13.662 s)
  PolykeyAgent
    ✓ PolykeyAgent readiness (2993 ms)
    ✓ creates node path (2833 ms)

Test Suites: 1 passed, 1 total
Tests:       2 passed, 2 total
Snapshots:   0 total
Time:        13.696 s, estimated 14 s
Ran all test suites matching /tests\/PolykeyAgent.test.ts/i.
GLOBAL TEARDOWN
Destroying Global Data Dir: /tmp/polykey-test-global-xltUTl

So indeed for alot of tests it would be sufficient to create their own pk agent at least for sharing in a test module without having to share the full global agent.

CMCDragonkai commented 2 years ago

This has been achieved in a number of ways.

Faster execution of ts-node by using ts-node caching and transpile only. This is applied locally and in CI/CD caching. Note that a test that takes 9 seconds on CI/CD using pkSpawn should take about 4 seconds on local machine. Simply because our local computers are better.
Mocking the usage of global keypair, this saves a significant amount of time because deterministic keypair generation is expensive. This is applied to alot of domains that use the KeyManager.
Mocked usage of the global agent, this works but has some downsides as discussed in #308, but it's one layer up from sharing a mocked global keypair. Use this sparingly only when needed, in alot of cases, a mocked global keypair is much faster and you just start PK agents as is. This should only really be used when the alternative is to use pkSpawn. If you're able to mock keypairs and start your own agents, it's much faster to do so.
Caching is applied to CI/CD everywhere, from jest caching to npm caching to ts-node caching. But doesn't affect our own jest tests.
Tests are auto-split and run as dynamic child pipeline jobs. This is much faster in CI/CD.
Networking timeouts have been worked out and it is all working, with much more extensive tests on how the networking termination works. Nodes related behaviour still needs updates in #225.
Further test splitting is to occur with @emmacasolin in #308, since these tests are all in one directory, they won't be split up in CI/CD for now, but we can do when we need to, and this will also speed up local testing as well.
All sorts of intermittent behaviour is being resolved, sometimes it's nice to use a for loop to run a test 100 times just to guarantee that intermittent behaviour doesn't occur. But it could happen in the future.
Worker manager has core count control, but right now workers cannot be tested via the CLI in jest. So worker manager tests stick to just local domain tests, and not in tests/bin.
Lots of timeouts are being replaced with just 2 timeouts, globalThis.maxTimeout and globalThis.defaultTimeout. The other 2 timeouts should be deprecated and removed and replaced with these 2. Do not have intermediate timeouts inside the test functions, this is not a good idea, and we should expect good performance in the entire test function.
Many other fixes that was brought in from #292.

MatrixAI / Polykey