Totodore / socketioxide

A socket.io server implementation in Rust that integrates with the Tower ecosystem and the Tokio stack.
https://docs.rs/socketioxide
MIT License
1.26k stars 56 forks source link

Relationship between `Bin` and the structure of `Data` can be ambiguous for binary packets #276

Closed kelnos closed 3 weeks ago

kelnos commented 8 months ago

Describe the bug

Related to #275, if socketioxide is fixed to support complex binary data formats, the handler API as-is becomes ambiguous and difficult to work with.

The handler presents the JSON structure as Data, and the binary payloads as Bin, essentially an array of payloads. Right now the API works fine in the supported cases, where the payload is a single-level array with some elements as binary, or a simple object where the root object itself is a single binary payload. That is, it's relatively unambiguous what parts of the data payload the binary bits correspond to. However, consider a more complex payload like this (currently unsupported by socketioxide, but I would like to fix that):

{
  "foo": 42,
  "bar": {
      "baz": "quux",
      "biff": { "_placeholder": "num": 0 }
  },
  "zip": { "_placeholder": "num": 1 }
}

From this structure, it might be obvious that the first payload in Bin is "biff", and the second is "zip", but that's only clear to someone who knows the protocol: a user of, say, the JavaScript socketio client is just passing around JSON objects with binary blobs in them. They don't know that the client is going to assign "biff" to the first payload and "zip" to the second. In fact, I'm not sure there's any requirement it do so; it might decide to do a "breadth-first" search for binary bits, and assign "zip" as the first payload and "biff" as the second.

To Reproduce

(There's no reproduction, as socketio doesn't currently handle data structures like this at all.)

Expected behavior

socketioxide should support some sort of "combined" view of the data by default. For example, I should be able to define some Rust structs for the above payload like so:

struct Bar {
    baz: String,
    biff: Vec<u8>,
}

struct Payload {
    foo: i32,
    bar: Bar,
    zip: Vec<u8>,
}

And then I should be able to create a socketioxide message handler that can convert a binary packet into a Payload struct. I shouldn't need to deal with separate Data and Bin arguments, and try to figure out how the elements of Bin match up with the structure of `Data.

Versions (please complete the following information):

Additional context

I don't believe it's possible to implement this myself today, using FromMessage or FromMessageParts, as Packet.incoming() intentionally discards the information needed to do this.

Totodore commented 8 months ago

After digging this issue a bit, the best solution would be to implement something based on serde directly. The following are possible ideas that need to be explored:

For the moment the first idea seems the best (if there are no blocking points) because with the second one, it would require to have a socketioxide-proc-macros lib created.

kelnos commented 8 months ago

That makes sense to me. I feel like the first idea does seem possible, and I agree it would be better. My first thought of how to handle this might be:

  1. Deserialize the incoming payload into serde_json::Value, and leave it as is. The packet type will let us know if it's binary, and how many attachments there are (currently it looks like the attachments count is ignored).
  2. As binary attachments are received, decode them, and then walk the payload, finding the _placeholder with the correct num. Replace that _placeholder object with a Value::Array(), filled with the binary data as a Vec<Value> (where each element of the vec is a Value::Number.
  3. When all attachments have been received, pass the final Value object to serde, to deserialize it into whatever the user-provided type in the handler is.

But there's one potentially-large downside of this: converting a Vec<u8> coming from the wire into a Vec<Value> seems pretty wasteful, and for larger binary sizes it may be pretty slow to convert between the two. An alternative that I kinda like is for for socketioxide to have its own (private) wrapper enum:

enum PayloadValue {
    Null,
    Bool(bool),
    Number(serde_json::value::Number),
    String(String),
    Array(Vec<PayloadValue>),
    Object(serde_json::Map<String, PayloadValue>),
    PendingBinary(u32),  // u32 is the 'num' field
    Binary(Vec<u8>),  // once the binary payload is received, PendingBinary gets replaced with Binary
}

It's of course annoying to duplicate serde_json::Value like that, but maybe it's worth it to do so. I think #[derive(Serialize, Deserialize)] will even almost work on that struct, if we provide a custom #[serde(with = "...")] for the PendingBinary and Object variants.

How does this sound to you? I'm happy to start work on it if you think it's a good approach.

Also curious as to what you think about backward compatibility. I'm not sure how tricky it will be to preserve the existing behavior for the cases that are supported (single-level array that can contain one or more binary blobs, single-level object that just holds a binary blob).

kelnos commented 8 months ago

Some more regarding back-compat: I didn't realize that BinaryPacket::incoming() was public, as are its .data and .bin members. With the new way of doing things, .bin is no longer needed (or possible, really, without making data copies), and .data is now the private PayloadValue type.

A few options:

  1. Maybe PayloadValue should actually be public? Then the types of ::incoming() and .data change. I'm not feeling too great about this, though.
  2. PayloadValue stays private, and:
    • BinaryPacket::incoming() keeps its type signature, and converts the passed Value to a PayloadValue (and internally, we can use a new, private, BinaryPacket::incoming_payload_value() to avoid an extra data type conversion).
    • BinaryPacket::data becomes private, but there's a new BinaryPacket::data() method that converts the internal PayloadValue into a serde_json::Value.
    • BinaryPacket::bin becomes private, but there's a new BinaryPacket::bin() method that returns a Vec<&Vec<u8>> by walking the PayloadValue and pulling out the PayloadValue::Binary bits.
  3. Back-compat is maintained: .data and .bin are retained, and are populated as best as possible, copying the data if needed. The PayloadValue version of .data is a new private member.

I think I like the second option the best: it retains the previous functionality if crate users need it, even if they will have to update their code. The first option would require users to make bigger changes (and would expose what, to me, is an internal implementation detail). The third option would allow users to upgrade without making any changes, but would mean more work and higher memory usage per binary packet, for usage of the library that would hopefully turn into the minority usage.

Also, as an aside, I guess this means we have to figure out the outgoing binary packet story too.

Totodore commented 8 months ago

There is no problem to break backward compatibility. The library is in its beta state and it is explicitely stated in the readme that API will change over the time. Appart from that, in fact incoming and outgoing should be private.

Your approach with this might be really good:

enum PayloadValue {
    Null,
    Bool(bool),
    Number(serde_json::value::Number),
    String(String),
    Array(Vec<PayloadValue>),
    Object(serde_json::Map<String, PayloadValue>),
    PendingBinary(u32),  // u32 is the 'num' field
    Binary(Vec<u8>),  // once the binary payload is received, PendingBinary gets replaced with Binary
}

If I understand want you want to do. After getting this PayloadValue, a custom Deserializer implementation will convert the PayloadValue to the user provided struct?

kelnos commented 8 months ago

There is no problem to break backward compatibility. The library is in its beta state and it is explicitely stated in the readme that API will change over the time.

That makes things a bit easier :)

If I understand want you want to do. After getting this PayloadValue, a custom Deserializer implementation will convert the PayloadValue to the user provided struct?

Yes, that's basically it. I played around with this a bit, and it doesn't even require much of a custom Deserializer. It just needs a little "help" with the Object variant.

For outgoing packets, I'm thinking of doing something similar, but in reverse. If the user provides some payload struct that looks like this:

struct SomePayload {
    id: u32,
    data: Vec<u8>,
}

... then I can probably "deserialize" that into:

PayloadValue::Object(HashMap{
    "id" -> PayloadValue::Number(32),
    "data" -> PayloadValue::PendingBinary(0),
})

And then when serializing it out into a packet to send, that'll get converted to a placeholder.

Anyhow, I'll push a PR when I have something for you to look at... sounds good?

Totodore commented 8 months ago

Sure ! Don't hesitate to add benchmarks and unit tests to check how it behaves compared to the current packet deserialization.

kelnos commented 8 months ago

Argh, I had a whole long message typed up here yesterday, but looks like it didn't post, and I lost it. I guess this is an opportunity to be less verbose :)

Got the PayloadValue concept working fairly well in the context of packet.rs. Moving outside it, I'm running into some trouble. FromMessage/FromMessageParts and MessageHandler are problems. Right now they take a serde_json::Value and a Vec<Vec<u8>> for the non-binary and binary parts respectively. In the new way of doing things, there's no separation. So what do we pass here? serde_json::Value no longer works: you can't embed binary data in there. (Well, you could, as a Value::Array of Value::Number, but then you make it essentially impossible to distinguish between binary data and something that's just an array of numbers.)

My next thought was to just get rid of the idea of an intermediate representation, and have it pass a T: Deserialize into those trait methods. The problem there is that I'm not sure how to actually store those message handlers in Client. Right now you're using a type-erasure trick, which works now because the Client doesn't really need to know the type the Data extractor is deserializing to in any given MessageHandler. But if the handler exposes the T directly, Client needs to do that deserialization now, so of course it would to know what T is, which means I don't think you can erase the type... which means I'm not really sure how you'd store the handler in that map. It's possible my imagination about how to (ab)use Rust's type system is failing me, and there is actually a way to do this, but I just don't know how. If you think this is doable and have any pointers (and think it's the right path to take), please let me know.

Another option would be to actually just go ahead and expose PayloadValue publicly. Initially I didn't really like that idea, but I'm starting to warm to it. One thing I'd probably do is remove the BinaryPlaceholder variant (what I used to call PendingBinary), and just have PayloadValue::Binary(usize, Vec<u8>). When the full binary event hasn't yet been fully received, the vec will just be empty. That would also makes incorporating received binary attachments a little easier and less expensive, so I'm going to do that even if we don't want to make it public.

What are your thoughts here?

kelnos commented 8 months ago

Now looking at ack.rs, I see instances of serde_json::Value there exposed publicly, like in AckInnerStream, which according to docs needs to be public for people who implement Adapter.

So I'm feeling more and more like making PayloadValue public is the simplest way forward.

Totodore commented 8 months ago

My next thought was to just get rid of the idea of an intermediate representation, and have it pass a T: Deserialize into those trait methods. The problem there is that I'm not sure how to actually store those message handlers in Client.

It is impossible to store a per-handler generic T, or to erase it (because of the necessity to have it to Deserialize it).

However I don't understand why you can't simply replace the serde_json::Value that is passed to the type erased handler by your custom Value enum? And then call from_socketio::<T>(Value) in the data extractor (the same way than currently).

Totodore commented 8 months ago

Now looking at ack.rs, I see instances of serde_json::Value there exposed publicly, like in AckInnerStream, which according to docs needs to be public for people who implement Adapter.

Indeed ack.rs is published but it is only because it is required to implement an adapter. And this use case is really rare. For me directly passing your custom Value type to the handler would break the current design with extractors (because it would be more complex for the user to manage it's incoming data).

Btw, don't hesitate to publish a draft PR so that I can check the code and that we can discuss implementation on it rather than on the issue :).

kelnos commented 8 months ago

However I don't understand why you can't simply replace the serde_json::Value that is passed to the type erased handler by your custom Value enum?

Yes, can do that, but then that would make my custom Value enum public, since FromMessageParts/FromMessage/MessageHandler are public. If we think that's ok after all (as I'm starting to think), then there's no problem.

Btw, don't hesitate to publish a draft PR so that I can check the code and that we can discuss implementation on it rather than on the issue :).

Will do! Gonna see if I can at least get things compiling, but either way I'll push something later today.

Totodore commented 8 months ago

Yes, can do that, but then that would make my custom Value enum public, since FromMessageParts/FromMessage/MessageHandler are public. If we think that's ok after all (as I'm starting to think), then there's no problem.

Yes it is not a problem, because even if it is public it is not meant to be used by the user. We can even make it #[doc(hidden)].

kelnos commented 8 months ago

Hm one thing -- the public methods on Packet currently accept a serde_json::Value. When I was thinking the new PayloadValue would be private, I changed those to take a T: Serialize instead.

Now that we're making PayloadValue public (though "semi-hidden"), I'm realizing T: Serialize doesn't work so well in some situations, and things would be simpler if I could just accept a PayloadValue, in place of the current serde_json::Value they currently take.

But that would make PayloadValue more prominent in the API. Is that ok?

bossman1999 commented 7 months ago

When will this issue be resolved?