3Hren / msgpack-rust

MessagePack implementation for Rust / msgpack.org[Rust]
MIT License
1.17k stars 130 forks source link

`PathBuf` serialises as string, should serialise as bytes #321

Closed WhyNotHugo closed 1 year ago

WhyNotHugo commented 1 year ago

I'm working on a tool that serialises its command line arguments and sends them over a socket. These are paths to files std::path::PathBuf.

std::path::PathBuf is not strictly speaking a string: it's actually backed by std::ffi::OsString. The biggest difference is that the String type in Rust is UTF-8, but filenames don't need to be valid UTF-8 (at least on Unix-like OSs)

Command line arguments aren't strings either (std::ffi::OsString). They also have no requirement to be valid UTF-8. Although most filenames are utf-8 (honestly, most of my file's filenames are ascii), there are exceptions and those make serialisation/de-serialisation fail. (for example, I've a bunch of photos and files from window from 2 decades ago and lots of their filenames are borked, but somehow they're still accessible).

In particular, interop also becomes tricky, because other languages will also treat the input expecting a utf8 string instead of a bytearray (e.g.: python) and crash due to a data-type mismatch.

I think PathBuf should be serialised into bin format, to avoid potentially invalid strings, and make sure the data type represents the underlying Rust type well. Do you think this is a reasonable change?

kornelski commented 1 year ago

I'm not sure about this, because it seems like a choice between two bad ways, only with different failure modes.

Native representation of Windows paths as UCS-2 is incompatible with the bag-of-ASCII-superset representation of paths on non-Windows. There exists WTF-8, but it's explicitly not meant for public use like this, and still can't be portable, because such broken names fundamentally can't be portable across systems.

I suppose the closest to reasonable solution would be to keep using UTF-8 on Windows, and not support broken paths there. And support a bag of bytes on Unix, and just hope users know what they're doing when moving paths between machines (without encoding information the paths are unusable unless both ends are set up to use the same codepage).

But it seems like a slim improvement? I'm not sure if it's worth changing the serialization format for this.

WhyNotHugo commented 1 year ago

I'm not familiar with how paths work on Windows. Can all PathBufs be represented as UTF-8? I get the impression that they cannot either, so the same issue would occur on Windows.

I guess that serialising a path in windows and deserialising it on Unix won't work, but that's actually the real nature of the underlying data (and a file with such name cannot exist on the other platform either). At least serialising on Unix to deserialise on Unix (and serialising on Windows to deserialise on Windows) could be made to work.

But it seems like a slim improvement?

Essentially, this makes RMP Serde unusable with PathBufs. And honestly, RMP Serde save a lot of work in manually writing serialisers/deserialisers for structs.

To clarify the current behaviour: serialising simply crashes: Syntax("path contains invalid UTF-8 characters") any time a PathBuf is non-UTF8 (which is perfectly valid input).

kornelski commented 1 year ago

Okay, PathBuf could use a string whenever possible and fall back to bytes only for non-UTF-8 Unix paths. That would be backwards compatible and still support other encodings. Would you like to implement that?

WhyNotHugo commented 1 year ago

Okay, PathBuf could use a string whenever possible and fall back to bytes only for non-UTF-8 Unix paths.

I think this is very tricky, because suddenly my schema has a field with a type that can be either String or Bytes. The reason I'm using msgpack is because I want to define a clear schema where other clients can interoperate; having a field where the type suddenly changes depending on content really makes things complicated for no good reason.

That would be backwards compatible

I appreciate wanting to keep backwards compatibility, but the current behaviour tries to serialise (and deserialise) potentially non-utf8 data as utf8 and simply crashes. Rust doesn't use String to back PathBuf data for a reason.

Fixing bugs will always break someone's some mis-use of it.

kornelski commented 1 year ago

I can't fix it. This is a hardcoded serde behavior:

https://github.com/serde-rs/serde/blob/0e947e6c3b2ff240e15c47fcf76e3bed5c099d0b/serde/src/ser/impls.rs#L863-L870

Please raise the issue with serde.