MatrixAI / js-quic

QUIC Networking for TypeScript & JavaScript
https://matrixai.github.io/js-quic/
Apache License 2.0
13 stars 1 forks source link

QUIC Connection local TLS error - Peer closed with transport code 306 #98

Closed gherkins closed 3 months ago

gherkins commented 6 months ago

Describe the bug

I'm trying to connect to Solana TPU leaders via the quic string from the concactInfo,

tpuQuic: '18.132.XXX.XX:8009',

but consistently get:

QUIC Connection local TLS error - Peer closed with transport code 306

This happens to everyone of the Adresses from the cluster. I just found error code 306 here: https://github.com/MatrixAI/js-quic/blob/94f38390a3f667a829460c330157fbcc5a27a0c1/src/native/types.ts#L239

which made me think I did the randomBytes part wrong, but I don't really see how... I also tried using peculiarWebcrypto as in the benchmarks...

To Reproduce


import {QUICClient} from "@matrixai/quic";
import {getRandomValues} from "node:crypto";

const client = await QUICClient.createQUICClient({
        host: HOST_IP
        port: parseInt(PORT),
        config: {
            verifyPeer: false,
        },
        crypto: {
            ops: {
                randomBytes: async (data: ArrayBuffer) => {
                    getRandomValues(new Uint8Array(data));
                },
            },
        },
    }
)

// -> ERROR BEING THROWN HERE

const clientStream = client.connection.newStream();
const writer = clientStream.writable.getWriter()
await writer.write(BUFFER);
await writer.close();

Expected behavior

established connection w/o errors

Platform (please complete the following information)

macOS 14.4.1 node v20.11.1 ts-node v10.9.2

linear[bot] commented 6 months ago

ENG-287 QUIC Connection local TLS error - Peer closed with transport code 306

CMCDragonkai commented 6 months ago

What QUIC stack is Solana TPU using? This library is based on Cloudflare's quic and it follows a particular bootstrapping process. Best way to debug this it also build and run the Solana TPU side, so you can see what why it is closing with 306 error.

On the otherhand, we have tests/utils.ts. You can see randomBytes utility function there, we use that for our testing, as prefer less nodeisms.

gherkins commented 6 months ago

Thank you, I tried the randomBytes implementation as in test/utils.ts, but that seems not to change much.


import {QUICClient} from "@matrixai/quic";
import * as peculiarWebcrypto from '@peculiar/webcrypto';

const webcrypto = new peculiarWebcrypto.Crypto();

async function randomBytes(data: ArrayBuffer) {
    webcrypto.getRandomValues(new Uint8Array(data));
}

const client = await QUICClient.createQUICClient({
  host: HOST_IP,
  port: parseInt(PORT),
  config: {
      verifyPeer: false,
  },
  crypto: {
      ops: {
          randomBytes
      },
  },
})

const clientStream = client.connection.newStream();
const writer = clientStream.writable.getWriter()
await writer.write(BUFFER);
await writer.close();

What QUIC stack is Solana TPU using? This library is based on Cloudflare's quic and it follows a particular bootstrapping process. Best way to debug this it also build and run the Solana TPU side, so you can see what why it is closing with 306 error.

I don't have enough in-depth understanding of the TPU server, yet. So I just assumed that quic communication would be rather universally?

Running TPU the server unfortunately is above my possibilities for the moment, but I that wrapped in try/catch block which truncated the full error message, which is:

ErrorQUICConnectionPeerTLS: Peer closed with transport code 306
    at constructor_.send [...]/node_modules/@matrixai/quic/src/QUICConnection.ts:947:23)
    at constructor_.send ([...]/node_modules/@matrixai/async-init/src/StartStop.ts:174:20)
    at [...]/node_modules/@matrixai/quic/src/QUICConnection.ts:833:18
    at [...]/node_modules/@matrixai/async-locks/src/Lock.ts:57:63
    at withF ([...]/node_modules/@matrixai/resources/src/utils.ts:24:18)
    at async constructor_.recv ([...]/node_modules/@matrixai/quic/src/QUICConnection.ts:749:5)
    at async Socket.handleSocketMessage ([...]/node_modules/@matrixai/quic/src/QUICSocket.ts:119:7) {
  data: {
    isApp: false,
    errorCode: 306,
    reason: Uint8Array(50) [
      114, 101,  99, 101, 105, 118, 101, 100,  32,
       99, 111, 114, 114, 117, 112, 116,  32, 109,
      101, 115, 115,  97, 103, 101,  32, 111, 102,
       32, 116, 121, 112, 101,  32,  73, 110, 118,
       97, 108, 105, 100,  83, 101, 114, 118, 101,
      114,  78,  97, 109, 101
    ]
  },
  cause: undefined,
  timestamp: 2024-04-09T07:03:42.048Z
}

Cheers

tegefaulkes commented 6 months ago

The reason message is received corrupt message of type InvalidServerName. Maybe the server is expecting a client certificate?

You can provide a key and certificate as part of the QUICConfig when starting the client.

/**
   * Private key as a PEM string or Uint8Array buffer containing PEM formatted
   * key. You can pass multiple keys. The number of keys must match the number
   * of certs. Each key must be associated to the the corresponding cert chain.
   *
   * Currently multiple key and certificate chains is not supported.
   */
  key?: string | Array<string> | Uint8Array | Array<Uint8Array>;

  /**
   * X.509 certificate chain in PEM format or Uint8Array buffer containing
   * PEM formatted certificate chain. Each string or Uint8Array is a
   * certificate chain in subject to issuer order. Multiple certificate chains
   * can be passed. The number of certificate chains must match the number of
   * keys. Each certificate chain must be associated to the corresponding key.
   *
   * Currently multiple key and certificate chains is not supported.
   */
  cert?: string | Array<string> | Uint8Array | Array<Uint8Array>;

Look at a QUICServer example for how to do this.

gherkins commented 6 months ago

You can provide a key and certificate as part of the QUICConfig when starting the client.

Unfortunately that does not change anything, connection seems to be established, then fails.

const tlsConfig = await generateTLSConfig('RSA');
const client = await QUICClient.createQUICClient({
        host: tpu_address.split(':')[0],
        port: parseInt(tpu_address.split(':')[1]),
        config: {
            key: tlsConfig.leafKeyPairPEM.privateKey,
            cert: tlsConfig.leafCertPEM,
        },
        crypto: {
            ops: {
                randomBytes
            },
        },
    }
)
INFO:QUICClient:Create QUICClient to 141.98.216.83:8009
INFO:QUICSocket:Start QUICSocket on [::]:0
INFO:QUICSocket:Started QUICSocket on [::]:64177
INFO:QUICConnection d13269a7a249aac1a6efc3f2a44e6e433491c161:Connect QUICConnection
INFO:QUICConnection d13269a7a249aac1a6efc3f2a44e6e433491c161:Start QUICConnection
INFO:QUICClient:ErrorQUICConnectionPeerTLS: QUIC Connection local TLS error - Peer closed with transport code 306
INFO:QUICConnection d13269a7a249aac1a6efc3f2a44e6e433491c161:ErrorQUICConnectionPeerTLS: QUIC Connection local TLS error - Peer closed with transport code 306
INFO:QUICConnection d13269a7a249aac1a6efc3f2a44e6e433491c161:ErrorQUICConnectionPeerTLS: QUIC Connection local TLS error - Peer closed with transport code 306
INFO:QUICSocket:Stop QUICSocket on [::]:64177
INFO:QUICSocket:Stopped QUICSocket on [::]:64177
INFO:QUICClient:Destroy QUICClient
INFO:QUICClient:Destroyed QUICClient
ErrorQUICConnectionPeerTLS: Peer closed with transport code 306
    at constructor_.send ([...]/node_modules/@matrixai/quic/src/QUICConnection.ts:947:23)
    at constructor_.send ([...]/node_modules/@matrixai/async-init/src/StartStop.ts:174:20)
    at [...]/node_modules/@matrixai/quic/src/QUICConnection.ts:833:18
    at [...]/node_modules/@matrixai/async-locks/src/Lock.ts:57:63
    at withF ([...]/node_modules/@matrixai/resources/src/utils.ts:24:18)
    at async constructor_.recv ([...]/node_modules/@matrixai/quic/src/QUICConnection.ts:749:5)
    at async Socket.handleSocketMessage ([...]/node_modules/@matrixai/quic/src/QUICSocket.ts:119:7) {
  data: {
    isApp: false,
    errorCode: 306,
    reason: Uint8Array(50) [
      114, 101,  99, 101, 105, 118, 101, 100,  32,
       99, 111, 114, 114, 117, 112, 116,  32, 109,
      101, 115, 115,  97, 103, 101,  32, 111, 102,
       32, 116, 121, 112, 101,  32,  73, 110, 118,
       97, 108, 105, 100,  83, 101, 114, 118, 101,
      114,  78,  97, 109, 101
    ]
  },
  cause: undefined,
  timestamp: 2024-04-09T07:54:08.325Z
}
CMCDragonkai commented 6 months ago

I mean you have to provide the appropriate client certificate - not just any certificate.

tegefaulkes commented 6 months ago

Based on what I can tell, this isn't a problem with the protocol. With QUIC the connection is established before TLS handshaking completes. So if the server rejects the TLS for whatever reason the client will 'establish' and then close with a code and message like you demonstrated. In this case it's a 306 error indicating DecodeError with the message received corrupt message of type InvalidServerName. This means that the server is rejecting the connection because of some requirement it has about the server name.

So it's important to note, whatever the problem is, it's the server taking issue with the connection, likely due to the certificate. I can't really find any information about how it uses QUIC while doing some quick research so I can't really comment on what exactly it could be.

gherkins commented 6 months ago

Thank you very much for your help! I'm also struggling to find some information about what would be an appropriate client certificate in that context. I thought providing any certificate might change the error message, but that didn't do much.

I will post an update here, if I find some solution. Closing for now - thanks again!

tegefaulkes commented 6 months ago

Happy to help.

CMCDragonkai commented 6 months ago

@gherkins the appropriate client certificate would depend on your target Solana TPU node expects. TLS certificates are supposed to be signed by an authority. Perhaps there is a an authority that you need to get your certificate signed by in order to the target server to accept your credentials? This is a core part of MTLS connections - and represents a sort of end to end identity check. Client checks server identity, server checks client identity. You'd have to ask about this wherever Solana TPUs exist.

gherkins commented 6 months ago

hm, seems like in the rust implementation it's done via x509 certificates 🤔

https://github.com/solana-labs/solana/blob/27eff8408b7223bb3c4ab70523f8a8dca3ca6645/quic-client/tests/quic_client.rs#L285

The TPU node is just one of those retrieved via connection.getClusterNodes() ContactInfo has an undocumented (in the ts library) property tpuQuic, which is just HOST-IP:PORT

https://github.com/solana-labs/solana-web3.js/blob/a0aa8a6f9fcc237c8b014cea7bd9d9616da608f9/packages/library-legacy/src/connection.ts#L705

I guess I'll try a self signed x509 certificate next and see if this changes the error message 🤷‍♂️

CMCDragonkai commented 6 months ago

There may actually be a signed authority required for all client certificates, best to ask someone on the solana team if that's required. These TPU nodes are not "public" nodes are they? (As in intended for just anybody to connect to).

lmvdz commented 6 months ago

Hi all,

Did some digging and found something that might be useful.

In the solana quic-client when the connection is made, there is a server_name string "connect"

    /// Connect to a remote endpoint
    ///
    /// `server_name` must be covered by the certificate presented by the server. This prevents a
    /// connection from being intercepted by an attacker with a valid certificate for some other
    /// server.
    ///
    /// May fail immediately due to configuration errors, or in the future if the connection could
    /// not be established.

I think this is related to the InvalidServerName

The reason message is received corrupt message of type InvalidServerName. Maybe the server is expecting a client certificate?

image https://docs.rs/quinn/latest/src/quinn/endpoint.rs.html#161 https://docs.rs/solana-quic-client/1.18.11/src/solana_quic_client/nonblocking/quic_client.rs.html#182

The server_name is being used here https://docs.rs/quinn-proto/0.10.6/src/quinn_proto/endpoint.rs.html#419

The EndpointConfig::default() config which is being used here as config.client.start_session()

has a reset_key length of 64 HMAC_SHA256 image

Hope this helps @gherkins excited to see this work in the tpu-client 👍

gherkins commented 6 months ago

Cheers @lmvdz, that looks interesting indeed.

Although I'm more sold on the idea that it's about a valid client certificate atm. Mainly because I played around with the rust implementation, where you can basically just do something like this:


    // endpoint being the interesting part here
    let endpoint = Arc::new(QuicLazyInitializedEndpoint::default());

   // as server_addr is really just the address, without any certificate handling, 
  // i guess that's rather done internally... 
    let server_addr = SocketAddr::new(
        IpAddr::V4(
            Ipv4Addr::new(
                127,
                127,
                127,
                0
            )
        ), 
        8888
    );

    //seems to be needed, really just a structure to hold some values about failed/succeeded 
    let connection_stats = Arc::new(ConnectionCacheStats::default());
    let client_connection = QuicClientConnection::new(endpoint, server_addr, connection_stats);

    // as this does send data to the server...
    match client_connection.send_data_async(buffer) {
            Ok(res) => dbg!(res)  ,
            Err(error) => panic!("error sending data: {:?}", error.unwrap()),
    };

Now QuicLazyInitializedEndpoint::default() seems to just yield some kind of anonymous self-signed client cert, doesn't it?

impl Default for QuicLazyInitializedEndpoint {
    fn default() -> Self {
        let (cert, priv_key) =
            new_self_signed_tls_certificate(&Keypair::new(), IpAddr::V4(Ipv4Addr::UNSPECIFIED))
                .expect("Failed to create QUIC client certificate");
        Self::new(
            Arc::new(QuicClientCertificate {
                certificate: cert,
                key: priv_key,
            }),
            None,
        )
    }
}

My best guess would be, that we need to emulate just that in the JS implementation as js-quic does take those arguments for the client side, too (cert & key).

So maybe that attempt here https://github.com/MatrixAI/js-quic/issues/98#issuecomment-2044402927 wasn't that far off.. I'll try some more certificate variations, when I have some time on my hands.

CMCDragonkai commented 6 months ago

The QUICConnection does support passing in serverName parameter - it's propagated to quiche's connect https://docs.quic.tech/quiche/fn.connect.html.

Do note that we have a special verifyCallback option that overrides the native TLS check - this is used for our purposes in Polykey, as we require a custom TLS verification procedure - compared to standard MTLS or HTTPS based connections. If Solana is not using a custom TLS procedure, you should not be using the verifyCallback option.

But my question still stands. These nodes you are connection to - are they meant to be "public" nodes? The need for client authentication usually indicates that they are not "public" nodes, as that implies the need for an authority to sign client certificates in some way. If they are public nodes, they would not bother to verify client certificates. If this is true, then you need to ask permission by whoever "owns" those nodes you are connecting to to sign your certificate - this is basically what MTLS is intended to do.

gherkins commented 6 months ago

Hey, yes - so from my understanding those node are totally meant to be publicly available. They're exposed by the RPC connection via getClusterNodes and it's encouraged in the official docs to send transactions there directly: https://solana.com/de/docs/core/transactions/retry#the-journey-of-a-transaction

The problem seemed to be only that node/js doesn't have an out-of-the-box quic implementation as rust does.

I would therefor also assume that you would not need client authentication, but the problem seems to be somewhat certificate related anyway 🤷‍♂️

CMCDragonkai commented 6 months ago

Can you do a sanity check on connecting to those nodes - by attempting an establishment with quiche directly then? Please note the bootstrapping protocol - but they have rust samples.

If they are intended to be public nodes I do not understand why they would require client certificates. Unless there's a special reason. Which should be explicit in their documentation.

CMCDragonkai commented 6 months ago

Actually... What is the valid server name? Is it the hostname or something else? When you connect sometimes this is important.

lmvdz commented 6 months ago

It’s hardcoded as “server”

On Wed, Apr 24, 2024 at 6:46 PM Roger Qiu @.***> wrote:

Actually... What is the valid server name? Is it the hostname or something else? When you connect sometimes this is important.

— Reply to this email directly, view it on GitHub https://github.com/MatrixAI/js-quic/issues/98#issuecomment-2076055981, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQUFPYB6WZOPTP2IDFDSEDY7A743AVCNFSM6AAAAABF563Z4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZWGA2TKOJYGE . You are receiving this because you were mentioned.Message ID: @.***>

thealexcons commented 6 months ago

Taken from this useful post

TLS connections on the web would typically also use this X.509 certificate to associate an external identity, like a domain name (e.g. forum.solana.com), as well as a signature chain vouching for the certificate’s validity. Solana validators, however, are inherently identified by their identity public key. There is no need to associate this key with external information. Consequently, there is no need for these X.509 certificates any signature chain nor any other pieces of data other than the public key itself. Notably, validators also have the ability to treat peers as “anonymous” and ignore their identity. This works because the message content is often authenticated by itself, regardless who is the sender. (Such as a gossip message)

It seems like current validator nodes just use self-signed dummy x509 certificates, so I am not sure why this code doesn't work...

You can provide a key and certificate as part of the QUICConfig when starting the client.

Unfortunately that does not change anything, connection seems to be established, then fails.

const tlsConfig = await generateTLSConfig('RSA');
const client = await QUICClient.createQUICClient({
        host: tpu_address.split(':')[0],
        port: parseInt(tpu_address.split(':')[1]),
        config: {
            key: tlsConfig.leafKeyPairPEM.privateKey,
            cert: tlsConfig.leafCertPEM,
        },
        crypto: {
            ops: {
                randomBytes
            },
        },
    }
)
INFO:QUICClient:Create QUICClient to 141.98.216.83:8009
INFO:QUICSocket:Start QUICSocket on [::]:0
INFO:QUICSocket:Started QUICSocket on [::]:64177
INFO:QUICConnection d13269a7a249aac1a6efc3f2a44e6e433491c161:Connect QUICConnection
INFO:QUICConnection d13269a7a249aac1a6efc3f2a44e6e433491c161:Start QUICConnection
INFO:QUICClient:ErrorQUICConnectionPeerTLS: QUIC Connection local TLS error - Peer closed with transport code 306
INFO:QUICConnection d13269a7a249aac1a6efc3f2a44e6e433491c161:ErrorQUICConnectionPeerTLS: QUIC Connection local TLS error - Peer closed with transport code 306
INFO:QUICConnection d13269a7a249aac1a6efc3f2a44e6e433491c161:ErrorQUICConnectionPeerTLS: QUIC Connection local TLS error - Peer closed with transport code 306
INFO:QUICSocket:Stop QUICSocket on [::]:64177
INFO:QUICSocket:Stopped QUICSocket on [::]:64177
INFO:QUICClient:Destroy QUICClient
INFO:QUICClient:Destroyed QUICClient
ErrorQUICConnectionPeerTLS: Peer closed with transport code 306
    at constructor_.send ([...]/node_modules/@matrixai/quic/src/QUICConnection.ts:947:23)
    at constructor_.send ([...]/node_modules/@matrixai/async-init/src/StartStop.ts:174:20)
    at [...]/node_modules/@matrixai/quic/src/QUICConnection.ts:833:18
    at [...]/node_modules/@matrixai/async-locks/src/Lock.ts:57:63
    at withF ([...]/node_modules/@matrixai/resources/src/utils.ts:24:18)
    at async constructor_.recv ([...]/node_modules/@matrixai/quic/src/QUICConnection.ts:749:5)
    at async Socket.handleSocketMessage ([...]/node_modules/@matrixai/quic/src/QUICSocket.ts:119:7) {
  data: {
    isApp: false,
    errorCode: 306,
    reason: Uint8Array(50) [
      114, 101,  99, 101, 105, 118, 101, 100,  32,
       99, 111, 114, 114, 117, 112, 116,  32, 109,
      101, 115, 115,  97, 103, 101,  32, 111, 102,
       32, 116, 121, 112, 101,  32,  73, 110, 118,
       97, 108, 105, 100,  83, 101, 114, 118, 101,
      114,  78,  97, 109, 101
    ]
  },
  cause: undefined,
  timestamp: 2024-04-09T07:54:08.325Z
}
thealexcons commented 6 months ago

Also, see this from a non-official project which connects to the solana validators:

/// takes a validator identity and creates a new QUIC client; appears as staked peer to TPU
// note: ATM the provided identity might or might not be a valid validator keypair
async fn new_endpoint_with_validator_identity(validator_identity: ValidatorIdentity) -> Endpoint {
    info!(
        "Setup TPU Quic stable connection with validator identity {} ...",
        validator_identity
    );
    // the counterpart of this function is get_remote_pubkey+get_pubkey_from_tls_certificate
    let (certificate, key) = new_self_signed_tls_certificate(
        &validator_identity.get_keypair_for_tls(),
        IpAddr::V4(Ipv4Addr::new(0, 0, 0, 0)),
    )
    .expect("Failed to initialize QUIC connection certificates");

    create_tpu_client_endpoint(certificate, key)
}

It generates a self-signed TLS certificate using the new_self_signed_tls_certifcate() function (source code) using the validator_identity which is just a keypair (ie: it doesn't matter if its a staked or unstaked node's keypair, so anyone should be able to connect)

tegefaulkes commented 6 months ago

The impression I'm getting is that the serverName parameter that quiche connect takes needs to be set to something different.
It was mentioned it was hard coded to server? Well internally the QUICClient sets it to the provided host parameter.

As a sanity check you can try monkey-patching the QUICClient code to set the serverName to server to see if that works?

gherkins commented 6 months ago

hm, just setting serverName in https://github.com/MatrixAI/js-quic/blob/staging/src/QUICClient.ts#L189 to "server" as a hardcoded string, produces another error code 376 ("peer doesn't support any known protocol")

Changing it to any random string, on the other hand, keeps producing the original error code (306 etc)

tegefaulkes commented 6 months ago

AH!, that's progress. It means we've moved on to a new problem. So it seems that setting serverName to server works. peer doesn't support any known protocol should mean that you didn't include the expected protocol in the config. https://github.com/MatrixAI/js-quic/blob/94f38390a3f667a829460c330157fbcc5a27a0c1/src/types.ts#L295-L305 You'll need to include at least 1 common protocol that the server supports. Otherwise the connection will be rejected.

Looking at the source code, the protocol is set to pub const ALPN_TPU_PROTOCOL_ID: &[u8] = b"solana-tpu"; https://docs.rs/solana-streamer/latest/src/solana_streamer/nonblocking/quic.rs.html#63. You'll need to add this to the applicationProtos array in the config for the client.

thealexcons commented 6 months ago

@tegefaulkes I tried adding "solana-tpu" in the applicationProtos array and I am now getting a new error (304): Failed connection due to native TLS verification

Note that I am generating my own key-certificate pair as so:

import { generateKeyPairSync } from "crypto";

const keypair = generateKeyPairSync('ed25519',  {
    privateKeyEncoding: { format: 'pem', type: 'pkcs8' }, 
    publicKeyEncoding: { format: 'pem', type: 'spki' }
});

And passing that into the config of the quic client:

config: {
    key: keypair.privateKey,
    cert: keypair.publicKey,
    applicationProtos: ["solana-tpu"]
}

The serverName is also hardcoded to be server, as discussed previously.

tegefaulkes commented 6 months ago

More progress, this one is harder to say though. Can you give the full error as it's printed out?

thealexcons commented 6 months ago

Seems like a problem with the certificate/key pair

Failed to send transaction to TPU ErrorQUICConnectionLocalTLS: Failed connection due to native TLS verification
    at /home/alex/projects/tpu-client/node_modules/@matrixai/quic/src/QUICConnection.ts:791:18
    at /home/alex/projects/tpu-client/node_modules/@matrixai/async-locks/src/Lock.ts:57:63
    ... 2 lines matching cause stack trace ...
    at async Socket.handleSocketMessage (/home/alex/projects/tpu-client/node_modules/@matrixai/quic/src/QUICSocket.ts:119:7) {
  data: { isApp: false, errorCode: 304, reason: Uint8Array(0) [] },
  cause: Error: TlsFail
      at /home/alex/projects/tpu-client/node_modules/@matrixai/quic/src/QUICConnection.ts:765:19
      at /home/alex/projects/tpu-client/node_modules/@matrixai/async-locks/src/Lock.ts:57:63
      at withF (/home/alex/projects/tpu-client/node_modules/@matrixai/resources/src/utils.ts:24:18)
      at async constructor_.recv (/home/alex/projects/tpu-client/node_modules/@matrixai/quic/src/QUICConnection.ts:749:5)
      at async Socket.handleSocketMessage (/home/alex/projects/tpu-client/node_modules/@matrixai/quic/src/QUICSocket.ts:119:7) {
    code: 'GenericFailure'
  },
  timestamp: 2024-04-29T21:01:34.150Z
}
tegefaulkes commented 6 months ago

Oh I see, It's the client failing the server's certificate now. That will happen after the connection has been established. Quic is just annoying like that.

Good new, there are two options here.

  1. It's likely that the server's certificate is self signed or signed outside of the usual authority chain. I can't really confirm that from this error. But if you know the server's certificate or what it is signed by then you can use that certificate as the CA in the quic config. https://github.com/MatrixAI/js-quic/blob/94f38390a3f667a829460c330157fbcc5a27a0c1/src/types.ts#L104-L116
  2. Override how the client verifies the server's certificates with the following options. https://github.com/MatrixAI/js-quic/blob/94f38390a3f667a829460c330157fbcc5a27a0c1/src/types.ts#L155-L170. We have a very similar use case in Polykey where certificates are self signed and we verify them based on their NodeId. You can see an example of it here https://github.com/MatrixAI/Polykey/blob/79ee0888a82fbbd6898a8fcda50aa80d33c54c2a/src/nodes/NodeConnection.ts#L242-L257. I think the verifyCallback will return a Promise<undefined> if verification should succeed.
gherkins commented 6 months ago

runnning with

config: {
    applicationProtos: ['solana-tpu'],
    verifyPeer: false,
},

now gives me

INFO:QUICClient:ErrorQUICConnectionPeer: QUIC Connection peer error - Peer closed with application code 1 ErrorQUICStreamInternal: Failed to prime local stream state with a 0-length message

not sure if that's progress... 🤔

thealexcons commented 6 months ago

Yeah, I am seeing the same error as @gherkins. Here's the full error log:

Failed to send transaction to TPU ErrorQUICStreamInternal: Failed to prime local stream state with a 0-length message
    at new QUICStream (/home/alex/projects/tpu-client/node_modules/@matrixai/quic/src/QUICStream.ts:298:17)
    at new QUICStream (/home/alex/projects/tpu-client/node_modules/@matrixai/events/src/Evented.ts:55:9)
    ... 4 lines matching cause stack trace ...
    at /home/alex/projects/tpu-client/src/index.ts:324:64 {
  data: {},
  cause: Error: StreamLimit
      at new QUICStream (/home/alex/projects/tpu-client/node_modules/@matrixai/quic/src/QUICStream.ts:291:25)
      at new QUICStream (/home/alex/projects/tpu-client/node_modules/@matrixai/events/src/Evented.ts:55:9)
      at new QUICStream (/home/alex/projects/tpu-client/node_modules/@matrixai/async-init/src/CreateDestroy.ts:49:26)
      at Function.createQUICStream (/home/alex/projects/tpu-client/node_modules/@matrixai/quic/src/QUICStream.ts:76:20)
      at constructor_.newStream (/home/alex/projects/tpu-client/node_modules/@matrixai/quic/src/QUICConnection.ts:1160:35)
      at constructor_.newStream (/home/alex/projects/tpu-client/node_modules/@matrixai/async-init/src/StartStop.ts:244:18)
      at /home/alex/projects/tpu-client/src/index.ts:324:64 {
    code: 'GenericFailure'
  },
  timestamp: 2024-04-30T18:36:31.390Z
tegefaulkes commented 6 months ago

We seem to be running into the stream limit now, this defaults to 100. Try modifying some of the stream parameters in the config to allow for more streams.

https://github.com/MatrixAI/js-quic/blob/94f38390a3f667a829460c330157fbcc5a27a0c1/src/types.ts#L223-L274

Otherwise the it seems the connection is being made just fine now and streams are being created. So things are mostly working now.

thealexcons commented 6 months ago

No luck changing these values, i still get the same error. But yes, you are right, I can see logs stating that the connection has started.

lmvdz commented 6 months ago

Shot in the dark here...

"Solana uses QUIC’s option to send a “challenge packet” to verify IP addresses. The whole point of this challenge is to avoid the certificate verification on the first step of the handshake, instead of doing it on the second part of the handshake after IP validation."

thealexcons commented 6 months ago

Interesting... Based on @lmvdz's comment, I set the enableEarlyData option in the config to true, and I now get a new error:

INFO:QUICConnection 8dc351f0e387b93834e48bc8b64f8e33839f8f9a:ErrorQUICConnectionPeer: QUIC Connection peer error - Peer closed with transport code 2
INFO:QUICConnection 1fec30934c178b361ac23f04660d2f5a139a579e:Started QUICConnection
INFO:QUICClient:Created QUICClient to [::ffff:162.19.43.7]:25009
INFO:QUICStream 0:Create QUICStream
Failed to send transaction to TPU ErrorQUICStreamInternal: Failed to prime local stream state with a 0-length message
    at new QUICStream (/home/alex/projects/tpu-client/node_modules/@matrixai/quic/src/QUICStream.ts:298:17)
    at new QUICStream (/home/alex/projects/tpu-client/node_modules/@matrixai/events/src/Evented.ts:55:9)
    ... 4 lines matching cause stack trace ...
    at /home/alex/projects/tpu-client/src/index.ts:331:64 {
  data: {},
  cause: Error: StreamLimit
      at new QUICStream (/home/alex/projects/tpu-client/node_modules/@matrixai/quic/src/QUICStream.ts:291:25)
      at new QUICStream (/home/alex/projects/tpu-client/node_modules/@matrixai/events/src/Evented.ts:55:9)
      at new QUICStream (/home/alex/projects/tpu-client/node_modules/@matrixai/async-init/src/CreateDestroy.ts:49:26)
      at Function.createQUICStream (/home/alex/projects/tpu-client/node_modules/@matrixai/quic/src/QUICStream.ts:76:20)
      at constructor_.newStream (/home/alex/projects/tpu-client/node_modules/@matrixai/quic/src/QUICConnection.ts:1160:35)
      at constructor_.newStream (/home/alex/projects/tpu-client/node_modules/@matrixai/async-init/src/StartStop.ts:244:18)
      at /home/alex/projects/tpu-client/src/index.ts:331:64 {
    code: 'GenericFailure'
  },
  timestamp: 2024-05-01T17:59:04.121Z
}

Still a stream limit error though, despite the values being quite big. Is this enableEarlyData referring to this initial "challenge packet"?

tegefaulkes commented 6 months ago

have you set initialMaxStreamsBidi and initialMaxStreamsUni to a higher value?

INFO:QUICConnection 8dc351f0e387b93834e48bc8b64f8e33839f8f9a:ErrorQUICConnectionPeer: QUIC Connection peer error - Peer closed with transport code 2 seems to suggest that there's an issue with the peer. And the fact we're getting that just before the connection has started is very weird.

As for the https://github.com/MatrixAI/js-quic/issues/98#issuecomment-2088793895 comment. I'm not really sure what that is about. It may refer to the usual initial hadshake proceedure where the server will receive a connection, reply with the challenge that the client uses moving forward. We do something like that. I'm a bit fuzzy on the details so I can't go into it right now.

Actually, looking deeper at the error log @thealexcons posted, That stream limit is being thrown on the first stream being created. Which is a far cry from the default limit of 100 streams. Something very weird is happening here.

tegefaulkes commented 6 months ago

Yeah, sorry. All I can really say is that the StreamLimit error is specifically an error coming out of quiche when trying to start a new stream. It's happening on the first stream being created. I can only assume it's a problem with config in some way but none of the defaults should cause this. Frustratingly the rust docs for quiche is a little vague about how some things work and when errors are thrown. So the StreamLimit error might mean a few things I'm not aware of.

Keep in mind, at this stage there is no data send on any stream, We're only initialising state for a stream in quiche. If it's not a config problem then maybe it's some other interaction. Maybe waiting for a few seconds before attempting the first stream would make a difference?

Try creating the connection, sleeping to a few seconds and then attempting the first stream. Alternatively play around with some of the other config options and see what happens.

lmvdz commented 5 months ago

Anyone working on this still? Feels like we're really close.

lmvdz commented 5 months ago

quick update..

i got it to work :)

gherkins commented 5 months ago

whoa, how? I've played around with the various limits and settings but with no success whatsoever...

lmvdz commented 5 months ago

It's not 100% tx hit rate, but this is what I did:

first thing was solana-quic doesn't support bidirectional communication... change the stream type to unidirectional. https://github.com/solana-labs/solana/blob/master/streamer/src/quic.rs#L96

const clientStream = client.connection.newStream('uni');

next the cert thing wasn't working CertificateRequired = 372

so i checked what solana does and replicated. (solana creates a self signed certificate, sending the private key doesn't count). https://docs.rs/solana-streamer/latest/src/solana_streamer/tls_certificates.rs.html#9-56

then I got DecodeError = 306, checked the cert on a website and saw that the key size was unsupported.. changed from default 1024 to 2048

import selfsigned from 'selfsigned';

const pems = selfsigned.generate([{name: 'commonName', value: 'Solana node'}, { name: "subjectAltName", value: [{ type: 7, value: "0.0.0.0" }]}], { days: 365, algorithm: 'ed25519', keySize: 2048 });

added the pems.private and pems.cert added applicationProtos added verifyPeer because atm we don't care about verifying the tpu's certificate.

                            const client = await QUICClient.createQUICClient({
                                config: {
                                    key: pems.private,
                                    cert: pems.cert,
                                    verifyPeer: false,
                                    applicationProtos: ['solana-tpu']
                                },
                                host: tpu_address.split(':')[0],
                                port: parseInt(tpu_address.split(':')[1]),
                                crypto: {
                                    ops: {
                                        randomBytes: async (data: ArrayBuffer): Promise<void> => {
                                            webcrypto.getRandomValues(new Uint8Array(data));
                                        },
                                    },
                                }
                            }
                        );

I was able to get a couple of txs sent and confirmed on solscan. (don't want to share the tx as it contains my wallet address)

But the amount of failed tries i am getting/error codes of Peer closed with application code 1 Peer closed with transport code 11 and Peer closed with transport code 2 is worrisome.

We will need to dig into what exactly solana is doing for the connection/transaction, because opening all these quic connections just to send one transaction doesn't seem very smart and will get our ip's blacklisted or something...

as far as sending the transaction and getting it confirmed in an efficient way, we can discuss that on the tpu-client repo...

tegefaulkes commented 4 months ago

Oh cool, Yeah the stream limit error makes sense now. It would've been hitting the stream limit for one of the directions since it was unidirectional only.

The amount of failures there are odd. But the're all peer errors with the code provided by the peer. Hopefully solana-quic documents what theses codes are.

gherkins commented 4 months ago

so, the only change really needed on js-quic side was the servername option to be configurable, right?

gherkins commented 4 months ago

@tegefaulkes maybe just add this via config option, since it won't break any method signatures...

https://github.com/MatrixAI/js-quic/pull/106

CMCDragonkai commented 3 months ago

Does #122 close this?

lmvdz commented 3 months ago

Yes

tegefaulkes commented 3 months ago

I'm marking this as close then. Fixed by #122