Jezza / mutf8

Basic support for mutf8 strings in Rust.
MIT License
4 stars 3 forks source link

Implementing `Deserialize` and `Serialize` for MString and mstr #1

Closed mainrs closed 4 years ago

mainrs commented 4 years ago

As both strings can (with some overhead) be converted back into String and str, it should be possible to implement both serde traits. The implementations would just convert between Rust native types and the mutf8 types.

On a side note, I happen to use this crate and am not quite sure what your FIXME comments mean in certain positions, specificalle https://github.com/Jezza/mutf8/blob/0c29171701a5bb4c275eac3298e9c3449074b88e/src/lib.rs#L44-L46 😅

mainrs commented 4 years ago

I use a cloned version for now that does it the following way:

#[cfg(feature = "serde")]
impl Serialize for MString {
    fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error> where
        S: Serializer {
        serializer.serialize_str(self.to_str().as_ref())
    }
}

#[cfg(feature = "serde")]
struct MStringVisitor;

#[cfg(feature = "serde")]
impl<'de> serde::de::Visitor<'de> for MStringVisitor {
    type Value = MString;

    fn expecting(&self, formatter: &mut Formatter) -> std::fmt::Result {
        formatter.write_str("an utf8 string")
    }

    fn visit_str<E>(self, v: &str) -> Result<Self::Value, E>
        where
            E: serde::de::Error {
        Ok(MString::from_utf8(v))
    }

    fn visit_string<E>(self, v: String) -> Result<Self::Value, E>
        where
            E: serde::de::Error {
        Ok(MString::from_utf8(v))
    }
}

#[cfg(feature = "serde")]
impl<'de> Deserialize<'de> for MString {
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error> where
        D: Deserializer<'de> {
        deserializer.deserialize_string(MStringVisitor)
    }
}

With these changes to Cargo.toml:

[dependencies]
serde = { version = "1.0.111", features = ["derive"], optional = true }

Just noticed, derive shouldn't be even needed as I implemented them myself.

Jezza commented 4 years ago

Hiya, thanks for the request!

I'll see about implementing it. Shouldn't take too long.

About the comments... Yeah, generally they're notes for myself, and I end up removing them quickly, as I tend to do the thing they're suggesting. In this case, I guess I didn't. Once I got the functionality up and running, I didn't really need to go much further with it.

Hell, I actually made the main functions in my version public, as that's how I use them.

I remember designing a new system to make it easier to deal with data, but I think I wrote that in a different crate. I'll have to search for it.

But point of the matter, yeah, I'll get right on that, and see about updating it.

Thanks!

Jezza commented 4 years ago

I'm fine just doing a simple implementation, whereas I just dump the UTF-8 string, but it raises the question that if you do a round trip, you won't have the same data anymore.

It'll represent the same thing, so I don't have a massive issue... But still.... I'm still deciding if serialising it straight to bytes... No idea how that would work with stuff like json. So UTF-8 is fine for now.

Jezza commented 4 years ago

Hm... Yeah, didn't think it was going to be this easy...

So, if the string actually contains invalid UTF-8, such as the NUL byte, then I can't serialise it...

I think I'll end up having to dump the bytes.

mainrs commented 4 years ago

Thanks for taking your time! I use this crate for getting better at using nom and I do actually do the same, parse java bytecode files. I tried to understand your class_file crate but honestly the whole bincode macros through me off hehe :D

Did you ever come around to implementing a bytecode manipualtion/analysis library? That is my final goal as I want to read out bytecode files for certain values.

The serde implementation needs to be reproducible. So dumping wouldn't probably work correctly. I've read through the mut8 reference and I thought that you could reproduce from utf8 to mutf8 without losing something IIRC. But maybe I've overlooked something.

I dont want to ask too much, but could you also implement Display if you're already at it? Maybe just returning a utf8 representation would be fine here. Not sure how it would work if it's not valid utf8 though.

Jezza commented 4 years ago

Haha, yeah, it's a bit complex, but that's only because I wanted to reduce a lot of the boilerplate I needed to write. It should be relatively straightforward to follow. I can answer any questions if you can any in particular.

Yeah, and no. So I kind of actually left the idea of class_file behind. My issue was that it was too.... invasive. You needed to construct all of the types. And modifying classfiles was even weirder.

When I last touched that topic, I decided to work on a better solution with classfile manipulation. As it would solve the whole reading, writing, and modifying issue. It was taking on a similar shape to ObjectASM, and I've gotten pretty far along with it.
I'm using it as a backend for one of my projects, and while not finished, definitely useable. I should probably release it at some point.

Yeah, dumping it as MUTF8 would work. Didn't think of that.

A display impl should also be easy enough. I'll take a look into that as well when I get a chance.

mainrs commented 4 years ago

Cool to hear that you came around and did follow the idea through to have some kind of byte-code analisys/manipulation library in Rust! IIRC objectasm has more of a visitor pattern to it compared to the approach we took. Does it feel better to use? I can only imagine that it still is "low-level" as you'd have to adjust stack sizes accordingly if you introduce new variables inside methods and stuff...

Not sure if you heard about https://github.com/SpongePowered/Mixin, it's like a ASM manipulation library on top of objectasm. But stuff like this is only possible in Java anyways...

Is the repository on GitHub (even private)? If you do not mind adding me as a maintainer I would love to take a look at it! Don't worry, I won't change stuff :D I do think you can even prevent me from pushing with a simple checkbox :)

Jezza commented 4 years ago

Yeah, objectasm does follow a visitor pattern, and works well enough to consider it.

The problem I had with a declarative structure was that to get to a specific point, you had to dig through all of the structures.

Granted, those structures weren't a "high-level" view.

But that's kind of my point. The visitor allowed me to hide a lot of the internals of the classfile immediately, instead of building a very low level representation, and then trying to hide it.

Whereas the visitor pattern objectasm implements allows me to jump straight to a high level,
and the only cost is dynamic dispatch.

It does feel better to use, but if I'm honest writing a new visitor is a bit annoying due to some of the constraints I've decided.

It's very similar to objectasm in that the framework itself can calculate the stackframes if requested. But you can also turn it off and do it yourself.

So, it's kind of a weird balance of being high enough that using it is pretty easy, but low enough, that I can generate very specific bytecode if I need.

I haven't heard of it. I haven't touched Minecraft related stuff for a long time.

Yeah, it's on github. It's currently private as it's still got a rough edge or two. There's definitely some things I want to implement before I consider it testable/viewable. But I'll definitely keep you in mind when I get to that stage.

Jezza commented 4 years ago

The code itself is a lot simpler. There's no magic proc macros. Actually, I haven't used a single macro thus far.

Writer: image

Reader: idea64_2020-06-08_12-26-57

mainrs commented 4 years ago

Nice, looks really good! I did play around with the idea to actually use serde, however I am not sure how ergonomic that would be for the end user. I'd have to settle down to fixed names for all the fields (class interface_count or access_flags) to ensure that the numbers are set correctly. But in theory the mapping should work and you'd even could change Opcodes as those would just be a serde struct as well, like a list of them inside of the Code attribute or something similar.

Changing Opcodes might be a little bit harder, as you'd have to find the right field and Opcodes yourself instead of relying on a visitor that just calls you when it hits the right spot. On the other hand, you can just build up a visitor or something similar on top of the deserialized structs.

I might try this approach just to see if it feels nice to use.

Jezza commented 4 years ago

Yeah, that was the idea behind class_file.

Build up a very low level view of the structures, and then rely on abstractions on top of that.

But that was the issue, the low level view wasn't nice to play around with.

There were a lot of "You need to handle if it's this case" or "To access this thing, you need to dig through a ton of fields".

The problem with the low level stuff was that it was also very low level, so a lot of the code was very similar. I found that serde didn't really offer any of the things I was trying to do. I also wanted to serialise it, and any of the crates that work with serde didn't work nicely. bincode did its own thing, etc.

I also found it a bit annoying that when I wrote code to read a structure, I would have to write code to also write the structure.

That's why I started work on binform.

Jezza commented 4 years ago

Sorry it took so long, but I finally got around to finishing it. All you need to do is update to 0.4.1.

I tested it with serde_json, and it seems to work without issue.

Give me a poke if it's not working correctly for some serialisers.