Data format adapter to expose control over the representation of bytes

dtolnay commented 4 years ago

Human readable formats tend not to include a universally agreed way to represent arbitrary binary data, which means those serde libraries can end up using a representation for serde's "bytes" type which isn't ideal for all uses.

Three examples:

// [dependencies]
// serde = { version = "1.0", features = ["derive"] }
// serde_bytes = "0.11"
// serde_json = "1.0"
// serde_yaml = "0.8"
// toml = "0.5"

use serde::Serialize;

#[derive(Serialize)]
struct Demo {
    #[serde(with = "serde_bytes")]
    bytes: Vec<u8>,
}

fn main() {
    let bytes = b"testing".to_vec();
    let s = Demo { bytes };

    println!("JSON: {}", serde_json::to_string(&s).unwrap());
    println!("YAML: {}", serde_yaml::to_string(&s).unwrap());
    println!("TOML: {}", toml::to_string(&s).unwrap());
}

JSON: {"bytes":[116,101,115,116,105,110,103]}
YAML: ---
bytes:
  - 116
  - 101
  - 115
  - 116
  - 105
  - 110
  - 103
TOML: bytes = [116, 101, 115, 116, 105, 110, 103]

It would be good to have an adapter which one can wrap around a Serde serializer or deserializer to make it use a different underlying representation for binary bytes, such as base64 (which itself may need to be customizable with base64 character set and padding).

Usage may look like:

// serialization

let value = /* object to be serialized */;

let mut ser = serde_json::Serializer::new(out);
let base64_config = base64::Config::new(base64::CharacterSet::UrlSafe, true);
let ser = BytesRepr::base64(&mut ser, base64_config);
value.serialize(ser)?;

// deserialization

let json = /* json string */;

let mut de = serde_json::Deserializer::from_str(json);
let de = BytesRepr::base64(&mut de, base64_config);
let t = T::deserialize(de)?;
println!("{:?}", t);

The implementation would look like a wrapper which wraps a Serializer or Deserializer and forwards through all the methods of those traits except for serialize_bytes and deserialize_bytes, converting those instead to serialize_str / deserialize_str with the appropriate base64 (or other) encoding applied.

jethrogb commented 4 years ago

I think we considered this approach before doing https://github.com/serde-rs/json/pull/656 an ran into some issues. However, it does seem like a wrapper should work. @jseyfried?

jethrogb commented 4 years ago

I believe the proposal is not possible. Implementations of Deserializer will have implementations of SeqAccess (and equivalent traits) in which the implementation of next_element_seed calls seed.deserialize(some_deserializer). See e.g. https://github.com/jethrogb/serde_json/blob/d133748/src/de.rs#L1735. The adapter would need to be able to wrap some_deserializer, but it can't.

dtolnay commented 4 years ago

Not clear why it couldn't wrap that. Check out these existing deserializer adapter libraries:

For example next_element_seed is wrapped here in one of them: https://github.com/dtolnay/serde-stacker/blob/9f19950f4490fc125f2a808301c58fa168def34d/src/de.rs#L678.

cfsamson commented 4 years ago

Are you thinking something along the lines of the example in this gist: https://gist.github.com/cfsamson/18009ca988ab1405b6ea6aaac819aed6 (I only created an example for serialization to see if I got the idea right)

This would support a fixed set of encoders/decoders for example base64, hex, Uuencoding? etc.

One thing I couldn't get working was serializing a struct using with = "serde_bytes". Somehow it seems to still call Serializer::serialize_seq which is unexpected and creates some problems.

dtolnay commented 4 years ago

That is in the right direction but you need an implementation of serialize_struct and many of the other methods. Currently you've serialized structs exactly the same as the underlying serializer.

Refer to https://github.com/dtolnay/serde-stacker/blob/0.1.4/src/ser.rs for an example of propagating custom behavior to struct fields correctly.

cfsamson commented 4 years ago

Yeah, that's a good reference implementation. I've always wanted to dive a bit more into both generics and Serde. This looks like a good opportunity to do that, so I'd be happy to take a stab at this and try to get something working.

jethrogb commented 4 years ago

@cfsamson We're intending to port over https://github.com/serde-rs/json/pull/656 to this design.

cfsamson commented 4 years ago

@jethrogb Ah, ok. Well, I've implemented a working serializer here https://github.com/cfsamson/rfi-serde-byte-repr, and I thought I'd check in before I went any further. I'm not sure if this still has any value for other serializers/deserializers (toml, yaml) though.

cfsamson commented 4 years ago

@dtolnay I might have jumped the guns a bit by finishing an implementation without waiting for any discussion. If so, that's on me, but sometimes it's hard to stop.

I recon this might still be useful for toml/yaml (de)serialization even though there are ongoing discussions on how to handle this in serde_json.

It could do with more tests though and ideas for good test cases is very welcome.

I haven't published anything yet but those interested can take a look here:

Documentation
link to the repository: https://github.com/cfsamson/rfi-serde-byte-repr

I decided to change the naming a bit. I'm still not sure about it though, it might be better to follow the conventions used in serde_stacker (a really cool library btw).

jethrogb commented 4 years ago

@cfsamson can you create an empty branch in your repository and then file a pull request from master into that branch? That would make it easy to leave review comments.

cfsamson commented 4 years ago

@jethrogb Sure. You can find the PR here: https://github.com/cfsamson/rfi-serde-byte-repr/pull/1

Edit Sorry, I thought the first commit was empty but I see now that quite a bit was already written cluttering up the PR. I'll see if I can create a totally empty branch with a related history.

cfsamson commented 4 years ago

@jethrogb So, I found a way to make this easy to review, but it included creating a new repository. You can take a look here: https://github.com/cfsamson/serde-byte-repr/pull/1

ssokolow commented 4 years ago

I'd prefer to have a default bytestring representation for PathBufs that leaves them human-readable in the 99%+ of cases where they don't contain invalid UTF-8.

In an attempt to follow the principle of least surprise, I cooked up an encoding scheme for my own use which works by using the null codepoint (which the JSON spec requires support for but which is only allowed in paths in the NT kernel object manager, where counted strings are used) as an escape character to indicate "the following codepoint should be re-interpreted as a byte".

Thus, with my scheme, any OsString not containing nulls or invalid UTF-8 passes through un-altered and the only strings which would have serialized successfully before but now have a different representation are those containing nulls.

I haven't published it to crates.io, but my own implementation is at https://gist.github.com/ssokolow/0d9f5c5e4a8a37a962875af205bcc723 if anyone wants to take a look, MIT/Apache licensed, complete with a small round-trip testing suite and, though the harness isn't included in the gist, I even did some performance tuning using Criterion. (Though I have no experience with low-level performance tuning and have no idea what performance to expect, so I was just flailing around, trying things blindly.)

If there's sufficient interest, I could try to make it the first crate I publish to crates.io, but it may take some time, since things are a bit of a mess right now.

cfsamson commented 4 years ago

@dtolnay Any idea on how/if we can move this forward?

I've created a working implementation in line with the initial proposal (AFAIK).

@ssokolow If I understand correctly you want to special case OsString so it's serialized to a readable format. I'm not sure if that's something that should go in this project since it's specifically about handling the representation of bytes. Converting bytes to text is a bit of a strange edge-case, but I'd rather this be done very explicitly than special casing it here.

I would think that this check could be done by another crate that took care of these kinds of text/bytes special cases and chose the right (de)serialization strategy based on some checks for validity.

dtolnay commented 4 years ago

I will close out this issue with a link to that repo. We can follow up on any further feature work in the issue tracker over there!

If you find @ssokolow's custom encoding scheme compelling then that would be implementable as one of the ByteFormat options in serde-byte-repr.

cfsamson commented 4 years ago

Sounds good. Please link to https://github.com/cfsamson/rfi-serde-byte-repr, the other repo was one I crated to make the codebase easier to comment on and will get deleted.

dtolnay commented 4 years ago

Looks like that is the one I linked to.

ssokolow commented 4 years ago

@cfsamson I want to ensure that PathBuf gets serialized in human-readable form while resolving the current issue that paths which can only be represented by OsString cause serialization to fail. I think that's a reasonable goal since 99%+ of paths are valid as Strings.

It's also in line with the current behaviour, which treats it as a String and fails if it contains non-UTF-8 bytes. I like to think of my scheme as having backwards compatibility goals akin to how all valid 7-bit ASCII is already valid UTF-8.

Because I used null as my escape character and neither POSIX nor Win32 allow nulls in paths, the encoding remains inter-compatible with what Serde currently does.

(Null is the only character which the JSON spec requires support for but which is disallowed in POSIX paths. The alternative would probably be something in the Private Use Area, in which case I'd have to research which codepoint is least likely to collide with things that live there, like ConScript and MUFI... it'd still work, but inter-compatibility with non-escaping serializers/deserializers would be lost for that codepoint.)

It just so happens that my encoding scheme works for any OsString as a side-effect of what a PathBuf is.

cfsamson commented 4 years ago

@ssokolow I see, well, as suggested that can be implemented as an option on ByteFormat which now only supports Hex and Base64, maybe there can be an option called ReadablePaths or something like that (I have no good name for this).

If you're willing to publish a crate with this encoding scheme with an API like the Base64 crate for example it would be only a few lines of code to add that option in.

I think I'm going to publish this crate on crates.io as it is now and we'll add the support you want as soon as it's finished. You can send an issue to that repo with a link to the crate or a PR. Does that sound like a plan?

ssokolow commented 4 years ago

Sounds good. For the last few days, I've had to pour all my effort into preparing for some hard drive replacements and overdue upgrades, but I'll leave this on my TODO list.

cfsamson commented 4 years ago

Great. The crate is now published: https://crates.io/crates/serde-bytes-repr

dtolnay / request-for-implementation

Data format adapter to expose control over the representation of bytes #48