Authenticate the sender of a hole-punching signalling message

joshuakarp commented 3 years ago

Upon receiving a hole-punching message (agentService:sendHolePunchMessage()), there is no reasonable means of ensuring that the message has come from the node ID stated in the message. Verification of a provided signature requires the public key of the source node.

In the future, we plan to move to Ed25519 keys (once this is supported in node-forge). In this event, the node ID will be the public key (instead of the public key fingerprint, as we require from RSA keys being 4096 bits). As such, authentication will be trivial, as we can simply verify the node ID (as the public key) against the signature.

joshuakarp commented 2 years ago

Currently, the only way I can securely envisage solving this is:

A wants to connect to B. It sends a signed hole-punch message with the target node of B
B receives the hole-punching message, seeing it's originally from A
B attempts to form its own connection to A, such that it can retrieve the public key of A (sends its own hole-punching message)

However, this creates a chicken or egg situation. A would also presumably want to verify that B is the one wanting to connect.

CMCDragonkai commented 2 years ago

@tegefaulkes @emmacasolin is this issue still necessary? I'm thinking that #322 was solved simply because of the TLS used for connections. Now if we receive a hole punching message, all we have to do is check that the message's from is equal to the connection's node ID. In fact one might suggest that the message itself doesn't even need a from node id, simply because we would always use the node id where the message's connection came from.

tegefaulkes commented 2 years ago

I'm not sure that we provide and check the certs of the sending side. Having a quick skim of the code I can see that we verify the cert chain within src/network/ConnectionForward.ts:192 but not in src/network/ConnectionReverse.ts.

So for this to be considered done we should be verifying the cert chain of the connecting agent within ConnectionReverse.ts

CMCDragonkai commented 1 year ago

When we introduce js-quic into the PK codebase, and rework our hole punching system in the network domain to use QUIC, we must review this issue to see if it is still relevant.

This will work as long as when the signalling node only signals between the target node and the source node, and it does not rely on what the internal message details about the source node. The message should only contain the target node.

CMCDragonkai commented 1 year ago

Signalling is done via RPC. Which means the request to signal is always authenticated point to point. However between 2 nodes and a central node doing the signalling transitively. The central node may send a message to the receiving node. The receiving node does need to check that the message is in fact signed by the original node.

@tegefaulkes please confirm that there's a transitive authenticity check during the signalling request. If that's done, we can close this.

tegefaulkes commented 1 year ago

The node is verified when connecting to another node yes, so the relaying node can verify the details. But we don't have any kind of signature in the message AFAIK, so the 3rd party node can't verify the details currently.

All we need to do is add a signature to the message and have the end node verify the contents. Keep in mind that the requesting node and relaying node both add details to the message. So both need to add a signature to the relay message depending how secure we want to be about it.

At a minimum we want to prevent a node making requests for other nodes to prevent a ad-hoc botnet DDOS attack.

CMCDragonkai commented 1 year ago

Yea ok nest the original message with the relayed message and sign it. This should also prevent repeat relaying too. Sounds like we should constrain relaying attempts.

CMCDragonkai commented 1 year ago

I added this into the 6th testnet deployment as during NAT testing, we should be able to add this in.

CMCDragonkai commented 1 year ago

Requirements:

A's message to S is { target: SIGNED<{ nodeId: B, ip: IP, port: Port }>}.
A's call to S is fire and forget.
S takes A's external IP and external port and combines it with the signed message. It sends to B { source: { nodeId: A, ip: IP, port: PORT }, target }.
S's call to B is fire and forget.
B checks the authenticity of the target. It takes source.nodeId and checks if it signed the target. It checks if source.nodeId is equal to target.nodeId.

Now there are some problems here.

If A and S are collaborating. A could create a message with an IP and port that is invalid. S could use the same IP and port that is invalid. S could then send this message to B. What could happen here? This can cause B to send packets to another IP and port, which wouldn't be very useful due to MTLS, but it could be used for DOS attacks.

To avoid DOS attacks we could do 2 additional things:

A, S and B don't put ip and port in their messages at all. Instead the only thing used is nodeId and A's signature. This ensures that B would use its own address book to find A's IP address and port. This relies on the idea that the NodeGraph of B would be up to date, or can easily ask the kademlia structure what A's address is.
Rate limiting: S and B can both rate limit requests using a token bucket approach correlated to each pair of NodeIds of source and target.

Perhaps both approaches can be used?

CMCDragonkai commented 1 year ago

I think the 2 additional things will be required to reduce the impact of DOS attacks.

Rate limiting has to be applied at B so it has 4 levels of protection:

Max dialing timeout
Exponential backoff
Node Connection coalescing
Token Bucket rate limiting

On top of this, IP and port are not part of the signalling messages. B will look up A's address the normal way in order to hole punch to A.

We can test this on the testnet to see if this works well, and what tradeoffs there are.

@amydevs this whole system can be visualised at the node explorer level, at a lower level simcity-style.

CMCDragonkai commented 1 year ago

Notes on how to build a token bucket limiting system: https://chat.openai.com/share/31d16b49-d95b-4f20-8f34-feeeb68741fb.

It's actually quite similar to the Semaphore and subsequently LockBox. You start with a token bucket, then you make a collection of token buckets keyed by a string.

CMCDragonkai commented 1 year ago

I also asked how to do this with TTLs that expire each token bucket. Final design here:

A priority structure like a priority queue could be used, but as you've noted, re-ordering keys when their TTL gets updated could make operations slower.

If we want a highly efficient approach that avoids iterating over all buckets every time and re-ordering keys, we should indeed consider storing individual setTimeout timers for each token bucket. Though there will be a setTimeout for each bucket, modern JavaScript engines like V8 are quite efficient at handling a large number of timers, especially if those timers are not too frequent.

Here's how this approach can be implemented:
class TokenBucket {
    // ... (rest of the code)

    creationTimestamp: number;

    constructor(protected capacity: number, protected refillRatePerMillisecond: number, public ttl: number) {
        this.tokens = capacity;
        this.lastRefillTimestamp = performanceNow();
        this.creationTimestamp = performanceNow();
    }
}

class RateLimiter {
    private tokenBuckets: Map<string, TokenBucket>;
    private expirationTimers: Map<string, ReturnType<typeof setTimeout>>;

    constructor(private defaultTTL: number = 60000) {  // default TTL 1 minute
        this.tokenBuckets = new Map();
        this.expirationTimers = new Map();
    }

    private scheduleExpiration(key: string, ttl: number) {
        const timer = setTimeout(() => {
            this.tokenBuckets.delete(key);
            this.expirationTimers.delete(key);
        }, ttl);

        this.expirationTimers.set(key, timer);
    }

    getBucket(key: string, ttl?: number): TokenBucket {
        let bucket = this.tokenBuckets.get(key);
        if (!bucket) {
            bucket = new TokenBucket(10000, 2, ttl || this.defaultTTL);  // Scaled values for example
            this.tokenBuckets.set(key, bucket);
            this.scheduleExpiration(key, bucket.ttl);
        }
        return bucket;
    }

    consume(key: string, tokensToConsume: number = 10000): boolean { // Scaled default value for example
        return this.getBucket(key).consume(tokensToConsume);
    }
}
In this version:

Each TokenBucket gets its expiration timer via scheduleExpiration.

There's a separate map, expirationTimers, that keeps track of each bucket's expiration timer.

When a timer fires, it will remove the bucket and its associated timer from the respective maps.

This approach should be quite efficient, even with a large number of buckets. However, if you expect to have a very large number of buckets, it might be worth testing this against the other approach to ensure it meets your specific performance and memory requirements.

CMCDragonkai commented 1 year ago

Could be done as a js-rate-limiter.

tegefaulkes commented 1 year ago

I've been thinking about this some more. If the relaying node is a bad actor then it can make anything up. So there's no point doing any authentication since the relay can just lie about it. Any signed message can be fabricated and challenge-response can be lied about as well. Basically, a relay node can pretend to be a relay for any number of made up requesting nodes. So any kind of authentication or message signing is pointless here.

Right now the request-ee node only does a single punch attempt at a time to a node. Right now it's keyed on ip:port tuple but that still allows multiple attempts to an address if you just change the port number. So I think we need to rate limit on the IP address as well.

Any rate limiting beyond that needs to work in a distributed way, and I'm not sure there's an easy solution to that? I don't think the token bucket addresses that.

The only thing I can think of that could works is make a single request mildly expensive to make with proof of work. So long as it's expensive to make a lot of requests to hole punch a single target then it should be fine. So a request can include a mildy expensive to generate nonce with it. This may have to be generated as a challenge so there needs to be some back a forth between the two nodes through the relay. Or, the punching node can track recently used Nonces and reject any replays.

In any case, we can think on this a little more. I don't want to implement anything we're not sure on and the authentication is not critical to testing and #551.

CMCDragonkai commented 1 year ago

The signature is from the original node. The bad actor relay node cannot make up the signature. The original request must always be from the original node and signed by the original node... Of course additional information is provided by the relay node and that is signed by the relay node.

CMCDragonkai commented 1 year ago

So to prevent amplification we are going to just use a single global token bucket for all signalling requests for nodes. This avoids the complexity of deciding to rate limit per connecting node, or per receiving node. So it just prevents overall amplification of any given signalling node.

General denial of service can only be done with proof of work. We can have a separate issue for PoW later in the future.

We still need a single TokenBucket, but prepare for potentially more than 1, by creating both the TokenBucket and RateLimiter classes, put it into the src/utils. It should be easily moved out to js-rate-limiter.

tegefaulkes commented 1 year ago

Ok, I've implemented a rate limiting system for signalling requests. Currently its a limit per requesting node to avoid denying signalling for other nodes. I'm all the config for rate limiting is hard coded right now. It's limited to 10 relay requests per second. This my need tuning.

tegefaulkes commented 1 year ago

Later we'll need to apply simple proof of work to really limit requests in a decentralised way. I'll make a new issue for this.

tegefaulkes commented 1 year ago

New issue at #556 for proof of work limiting.

CMCDragonkai commented 11 months ago

The signature is from the original node. The bad actor relay node cannot make up the signature. The original request must always be from the original node and signed by the original node... Of course additional information is provided by the relay node and that is signed by the relay node.

I think this should still be done as a matter of principle. It's also good for auditing.

CMCDragonkai commented 11 months ago

Further discussion about robust rate limiting will be in #556.

MatrixAI / Polykey

Authenticate the sender of a hole-punching signalling message #148