MatrixAI / js-quic

QUIC Networking for TypeScript & JavaScript
https://matrixai.github.io/js-quic/
Apache License 2.0
13 stars 1 forks source link

Gracefully handle certain `QUICSocket.send` errors #86

Closed tegefaulkes closed 7 months ago

tegefaulkes commented 7 months ago

Description

This PR attempts to fix the problem with socket send errors not being handled properly when they're specific to the connection triggering the send.

Issues Fixed

Tasks

ghost commented 7 months ago
👇 Click on the image for a new way to code review #### [![Review these changes using an interactive CodeSee Map](https://s3.us-east-2.amazonaws.com/maps.codesee.io/images/github/MatrixAI/js-quic/86/d1bf84fe/70df59c1a4ca985dea831fd7d0412f78d70dc280.svg)](https://app.codesee.io/r/reviews?pr=86&src=https%3A%2F%2Fgithub.com%2FMatrixAI%2Fjs-quic) #### Legend CodeSee Map legend
tegefaulkes commented 7 months ago

I think this stems from a design decision where data flowing downwards are normal function calls where data flowing upwards would be events. In this case it's making it inconvenient to bubble the send error where we need it.

Consider the following.

image

The QUICConnection emits a send event, the QUICClient OR QUICServer takes this and passes it along to the QUICSocket. This is already a little round-about but it breaks any reasonable flow back the the QUICConnection if there was an error.

The recv path takes a much cleaner flow through normal function calls.

For our needs, it would be much simpler if the QUICConnection makes the QUICSocket calls directly, then it can catch the errors it needs. Something like the following.

image

CMCDragonkai commented 7 months ago

I want to maintain the design decision for now.

To solve the #78 I think the proper solution is about distinguishing these kinds of errors.

This is still doable with event/push flow.

It's a matter of the the thing catching the exception and then deciding exactly what kind of event to dispatch.

So instead of changing our control flow and data flow design, keep it the same and figure out how to distinguish 2 different event types.

CMCDragonkai commented 7 months ago

Remember push flow is inherently multi-consumer possible. So QUIC connections are all registered on the the socket events, but one can also register a specific kind of event type - if they all register to the same event type, they can further check the contents of the event to see if it is relevant to them.

Generally it's more efficient to register unique event IDs - but this may not always be possible.

CMCDragonkai commented 7 months ago

Any design decisions on how events is going to work needs to take into the global consideration of https://github.com/MatrixAI/Polykey/issues/444, and to add commentary there.

CMCDragonkai commented 7 months ago

We need to make sure our event abstractions and push flow abstraction makes sense in all our work.

tegefaulkes commented 7 months ago

If we want to go the event route, then there are two things to do.

  1. QUICConnection needs to have a reference to QUICSocket to be aware of any send errors.
  2. QUICConnection needs to add an event handler to QUICSocket send errors to handle the errors.

Since there are many connections to 1 socket, each connection will have to filter for the events only relevant to them. This means every active connection will have to do some processing for every error event filtering for the only one relevant to itself. I'm not sure I like that.

Say we have 100 connections and the network cuts out, Now we have 100 sends failing. Each failed send will have an event being handled by all 100 connections trying to filter for what it wants. So we have 10000 if-checks happening. It's not going to scale very well.

CMCDragonkai commented 7 months ago

Firstly QUICConnection already has reference to the QUICSocket. And there's only 1 socket for all of them.

As for scaling the dispatch, there are ways to do this, lots of ways to achieve it, but the original dataflow/controlflow design should be maintained.

CMCDragonkai commented 7 months ago

Like I said, it's possible for lots of connections to register specific event IDs, thus enabling O(1) dispatch.

CMCDragonkai commented 7 months ago

The event names right now correspond to the to the name of the type. But I believe if you go to js-events you can imagine how it can be architected to support parameterised names.

And this should lead you to understand how that might work seamlessly with observables.

tegefaulkes commented 7 months ago

Okay, I'll look into it.

CMCDragonkai commented 7 months ago

And every importantly, the node specific warnings on event listener counts might need to be entirely eliminated, and instead what we do is manage the total counter of events in a global event bus system that keeps track of it, and can do a total warning if it detects resource leaks at a global level.

tegefaulkes commented 7 months ago

EventTarget is kind of not great to use, I've been thinking we'd just make our own pure TS event class at some point.

CMCDragonkai commented 7 months ago

That's kind of what AbstractEvent is atm. I think that js-events is what needs to do its overrides/abstractions there, but the underlying system should still be EventTarget to keep it compatible with the rest of the JS ecosystem.

tegefaulkes commented 7 months ago

I've got the fix down but just need to review and polish.

I'm thinking I can do a fix for https://github.com/MatrixAI/Polykey-CLI/issues/115 in this PR as well. At minimum I have the groundwork down for it. I just need to work out what errors happen when the internet drops out.

tegefaulkes commented 7 months ago

Ultimately the solution was to create a new event for a send error that is coded with the connection ID of the connection so the connection can listen for the error event for it specifically. Right now it only this logic applies mostly to QUICClient since the EINVAL error can only happen when specifying bad parameters for new connections. server connections shouldn't run into this.

tegefaulkes commented 7 months ago

I think if we just ignore or do warnings about the following error then we can prevent crashes when the network cuts out.

Error: send ENETUNREACH ::ffff:13.54.214.222:1314
    at doSend (node:dgram:716:16)
    at defaultTriggerAsyncIdScope (node:internal/async_hooks:463:18)
    at afterDns (node:dgram:662:5)
    at processTicksAndRejections (node:internal/process/task_queues:83:21) {
  errno: -101,
  code: 'ENETUNREACH',
  syscall: 'send',
  address: '::ffff:13.54.214.222',
  port: 1314
}
tegefaulkes commented 7 months ago

I'm almost done with this. I'm just adding some tests to test network outages. Mainly mocking the udpsocket send to throw the same kind of error I posted in the above comment.

The first test is working fine. But the 2nd where the connection times out when the sends is failing is working mostly. The only problem is the ErrorQUICConnectionIdleTimeout is being thrown somewhere even after the test ends successfully. Trying to track down the error through all the event handling is tricky since we're loosing basically all useful stack information in the process.

CMCDragonkai commented 7 months ago

Error handling for push flows is inherently different from pull flows. Errors have to explicitly handled at the handler and possibly flow to a different global error handler. Don't rely on the error stack for push flows.

tegefaulkes commented 7 months ago

I'm considering this one done now if you want to look it over.