Serde format for in situ JSON deserialization

dtolnay commented 5 years ago

Originally filed as https://github.com/serde-rs/json/issues/318, but since this will involve quite a bit more unsafe code than currently exists in serde_json I think it could be better to put in a separate crate. Much of the implementation can be drawn from serde_json.

The idea is to implement a function like:

pub fn from_mut_str<'a, T>(s: &'a mut str) -> Result<T>
where
    T: Deserialize<'a>;

backed by a Serde Deserializer that knows how to unescape JSON strings in place.

#[derive(Deserialize)]
struct S<'a> {
    msg: &'a str,
}

// input
"{\"msg\":\"abc\\nxyz\"}"

// after unescaping in place
"{\"msg\":\"abc\nxyzz\"}"
            ^^^^^^^^ output &str

fn main() {
    let mut j = "{\"msg\":\"abc\\nxyz\"}".to_owned();
    let actual = from_mut_str::<S>(&mut j).unwrap();

    let expected = S { msg: "abc\nxyz" };
    assert_eq!(actual, expected);
    assert_eq!(actual.msg.as_ptr(), j[8..].as_ptr());
    assert_eq!(j, "{\"msg\":\"abc\nxyzz\"}"); // not a commitment in the API, but in practice this is what it will be
}

In the example, the incoming bytes in memory would contain a subslice looking like:

    +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
... |  :  |  "  |  a  |  b  |  c  |  \  |  n  |  x  |  y  |  z  |  "  |  }  | ...
    +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

and JSON deserialization shuffles them to lay out the correct output string bytes in place while modifying the minimum amount of memory:

                +-----+-----+-----+-----+-----+-----+-----+
            ... |  a  |  b  |  c  | \n  |  x  |  y  |  z  | ...
                +-----+-----+-----+-----+-----+-----+-----+

pickfire commented 4 years ago

    assert_eq!(j, "{\"msg\":\"abc\nxyzz\"}"); // not a commitment in the API, but in practice this is what it will be

Is the extra z a mistake?

Is this code similar without the deserializing? (playground)

#[derive(Debug, PartialEq)]
struct S<'a> {
    msg: &'a str,
}

fn main() {
    let j = "{\"msg\":\"hello\\nworld\"}".to_owned();
    let p = j.as_ptr();

    let actual = S {
        // or j.get_unchecked(8..20)
        msg: unsafe { std::str::from_utf8_unchecked(std::slice::from_raw_parts(p.add(8), 12)) },
    };
    let expected = S {
        msg: "hello\\nworld",
    };

    assert_eq!(actual, expected);
    assert_eq!(actual.msg.as_ptr(), j[8..].as_ptr());
}

dtolnay commented 4 years ago

It is not a mistake.

No, that code is wrong because the correct deserialized value would contain "hello\nworld" not "hello\\nworld".

pickfire commented 4 years ago

Ah, so there is a reduction of \ in place. I did saw that but not sure if it's correct.

But the mistake I mean is the additional z. Why would xyz becomes xyzz?

dtolnay commented 4 years ago

xyz doesn't become xyzz. Instead nxyz becomes xyzz. The original z stays in place because anything else would require more movement of data in memory (4 bytes instead of 3 bytes).

dtolnay commented 4 years ago

The caller provides memory that has this as a subslice:

    +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
... |  :  |  "  |  a  |  b  |  c  |  \  |  n  |  x  |  y  |  z  |  "  |  }  | ...
    +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

and JSON deserialization must shuffle bytes in some way to produce:

                +-----+-----+-----+-----+-----+-----+-----+
            ... |  a  |  b  |  c  | \n  |  x  |  y  |  z  | ...
                +-----+-----+-----+-----+-----+-----+-----+

so as you can see we don't need to modify more bytes outside of that.

pickfire commented 4 years ago

Ah, I thought we are going to shift all the bytes from the right because the last z would be unused. Is it useful? Imagine having a lot of newline in a string, I know memory it cheap but any use case with that?

I believe maybe like this? (Playground)

#[derive(Debug, PartialEq)]
struct S<'a> {
    msg: &'a str,
}

fn main() {
    let mut j = "{\"msg\":\"abc\\nxyz\"}".to_owned();

    let actual = S {
        msg: unsafe {
            let p = j.as_mut_ptr().add(12);
            *p = b'\n';
            std::ptr::copy(p, p.offset(-1), 4);
            j.get_unchecked(8..15)
        },
    };
    let expected = S { msg: "abc\nxyz" };

    assert_eq!(actual, expected);
    assert_eq!(actual.msg.as_ptr(), j[8..].as_ptr());
    assert_eq!(j, "{\"msg\":\"abc\nxyzz\"}");
}

dtolnay commented 4 years ago

shift all the bytes from the right because the last z would be unused. Is it useful? Imagine having a lot of newline in a string, I know memory it cheap

No, we shouldn't move around more bytes than necessary. That's more work. It doesn't save memory. The whole input buffer needs to outlive the deserialized object anyway.

dtolnay commented 4 years ago

I think the right amount of copying in the example would be the equivalent of:

let p = j.as_mut_ptr();
*p.offset(11) = b'\n';
ptr::copy(p.offset(13), p.offset(12), 3);
j.get_unchecked(8..15)

pickfire commented 4 years ago

@dtolnay I don't think that matters, for me I believe reducing one offset operation makes it faster? Maybe a benchmark will tell. (of course we could do it in two pass like simd json to make it faster)

Either way, how should one get started on this?

dtolnay commented 4 years ago

https://serde.rs/impl-deserializer.html would be the best place to start, and modify the implementation to take &mut str and process escaped strings in place. Once the basics work, you could pull in some of the number parsing from serde_json that isn't in the example code and then put together some benchmarks.

pickfire commented 4 years ago

@dtolnay I tried building on top of SliceRead from serde-json in which the slice now becomes slice: &'a mut [u8] but I am stuck in lifetime problem. It took me quite some time to figure out the logic through trial and error, so confusing. When I uncomment https://github.com/pickfire/json/blob/544b50e6c6270abbdea6d1d93deffb0afd02987a/src/read.rs#L671-L672, it shows:

error[E0495]: cannot infer an appropriate lifetime for lifetime parameter in function call due to conflicting requirements
   --> src/read.rs:671:37
    |
671 |                     let borrowed = &self.slice[start..end];
    |                                     ^^^^^^^^^^^^^^^^^^^^^^
    |
note: first, the lifetime cannot outlive the lifetime 's as defined on the method body
at 634:24...
   --> src/read.rs:634:24
    |
634 |     fn parse_str_bytes<'s, T: ?Sized, F>(
    |                        ^^
note: ...so that reference does not outlive borrowed content
   --> src/read.rs:671:37
    |
671 |                     let borrowed = &self.slice[start..end];
    |                                     ^^^^^^^^^^
note: but, the lifetime must be valid for the lifetime 'a as defined on the impl at 595:6...
   --> src/read.rs:595:6
    |
595 | impl<'a> SliceMutRead<'a> {
    |      ^^
    = note: ...so that the types are compatible:
            expected std::result::Result<read::Reference<'a, 's, _>, _>
               found std::result::Result<read::Reference<'_, '_, _>, _>

error: aborting due to previous error

Any idea to solve the lifetime problem?

pickfire commented 4 years ago

Thanks to alice in https://users.rust-lang.org/t/help-about-multiple-mutable-borrows/33494/8?u=pickfire, I found using transmute to transmute the lifetime which cannot be inferred useful to solve this issue. About safety, I believe the reference is valid for as long as 'a but not 100% sure.

@dtolnay I believe you can take a look at the prototype now at https://github.com/pickfire/json/commit/6bd0238645f6b880dc6770a5c127c5d340335734 in which it only implements a basic \n replace as for now.

dtolnay commented 4 years ago

I've only skimmed the code, but it looks like a good start.

This looks problematic though:

let mut s = "\"\\n\u{2603}\"".to_owned();
assert_eq!("\n\u{2603}", from_str::<String>(&s).unwrap());

let undefined_behavior = &[10, 10, 226, 152];
assert_eq!(undefined_behavior, from_mut_str::<&str>(&mut s).unwrap().as_bytes());
assert!(std::str::from_utf8(undefined_behavior).is_err());

For a valid JSON input containing no escape sequences other than \n, it gives back a &str containing bytes that are not valid utf8, which is undefined behavior.

pickfire commented 4 years ago

Is that because I have yet add the other escapes? https://github.com/pickfire/json/commit/6bd0238645f6b880dc6770a5c127c5d340335734#diff-f400f58d5e78694b93db5d4fe5f2a246R681

#[test]
fn test_from_mut_str_newline() {
    #[derive(Debug, PartialEq, Deserialize)]
    struct S<'a> {
        msg: &'a str,
    }

    let mut j = "{\"msg\":\"\\n\"}".to_owned();
    let actual = from_mut_str::<S>(&mut j).unwrap();

    let expected = S { msg: "\n" };
    assert_eq!(actual, expected);
    assert_eq!(actual.msg.as_ptr(), j[8..].as_ptr());
    assert_eq!(j, "{\"msg\":\"\n\n\"}"); // not a commitment in the API, but in practice this is what it will be
}

Not a single \n (fail in mine either) but the above works, maybe just because I have yet implemented the rest.

dtolnay commented 4 years ago

No I don't think that is the reason. The input string in my repro is "\n☃" which only contains one escape.

pickfire commented 4 years ago

Yes, even just a single \n (1 char) fails because I don't manage those yet except \n (2 char).

dtolnay commented 4 years ago

To clarify, my repro contains \ followed by n followed by ☃. There are no escape sequences other than \n, and there is no newline character which isn't legal in a JSON string anyway. I believe this is exactly the case you are saying you already handle, but it is producing undefined behavior.

Heads up that I am going to step away from this issue for a while because I am swamped with other things at the moment, but hopefully Discord will be able to provide help if you link them to your repo.

pickfire commented 4 years ago

@dtolnay I fixed the snowman bug, I forgot to ignore the memory before the first \ and forgot to move the memory before the closing ". Can you please help to check again? I tested both ☃\n and \n☃ and also reduced down to 3 unsafe lines (2 lines are duplicated).

jonasbb commented 4 years ago

@pickfire I don't think the snowman bug is fixed in revision 2e4b6b08650b41582430ca004b190b57e0a69c2e. Let me try to explain the issue a bit differently, maybe this will help you.

The issue at hand is, that your code modifies the byte array of the String in a way, that the remaining bytes after the deserialization step are no longer valid UTF-8. However, this is undefined behavior (as @dtolnay already mentioned), because the String type assumes that the storage bytes are always valid UTF-8.

For a valid JSON input containing no escape sequences other than \n, it gives back a &str containing bytes that are not valid utf8, which is undefined behavior.

The basic functionality of your code works. It can correctly deserialize a JSON string:

let mut s = r#""\n☃""#.to_owned();
assert_eq!("\n☃", from_str::<String>(&s).unwrap());

Let's see how the string looks like on a byte level:

Hex  Dec Binary
// "
0x22  34 00100010
// \
0x5c  92 01011100
// n
0x6e 110 01101110
// ☃
0xe2 226 11100010
0x98 152 10011000
0x83 131 10000011
// "
0x22  34 00100010

Now, if we use the new method from_mut_str, we also see that the code parses the JSON correctly:

assert_eq!("\n☃", from_mut_str::<String>(&mut s).unwrap());

If we look at the bytes after the in situ deserialization step, they look like this

Hex  Dec Binary
// "
0x22  34 00100010
// \n
0x0a  10 00001010
// ☃
0xe2 226 11100010
0x98 152 10011000
0x83 131 10000011
// Invalid start of UTF-8 codepoint
0x83 131 10000011
// (Garbage) "
0x22  34 00100010

They start out valid and as expected, with the literal newline character and the remaining bytes shifted forwards. But after the snowman symbol (bytes 0xe2, 0x98, 0x83) we see another 0x83. This is not a valid start of a UTF-8 encoded codepoint. It starts with the two bits 10, which only ever occur as continuation bytes, but never as the start of an encoded codepoint. This now violates the String assumption that bytes are always valid UTF-8.

Here are two ways to show how this is problematic:

Normally taking the bytes of a string and creating a stringslice from it should always work. After all, the bytes must be valid UTF-8. If we do this here, we receive an error:

std::str::from_utf8(s.as_bytes())
// Results in:
// Err(Utf8Error { valid_up_to: 5, error_len: Some(1) })

Another way is iterating over the chars in s like:

for c in s.chars() {
    println!("{:?}", c);
}
// Output:
// '\"'
// '\n'
// '☃'
// 'â'

The last character could be anything or even some runtime error, since this is undefined behavior, and we do not see the last ", which should still be seen in the s string after the replacement.

Now how to fix this? After shifting the bytes forward due to the in situ replacement of an escape sequence, you need to make sure that the end bytes are still valid UTF-8. Overwriting the trailing bytes with 0-bytes should work for Rust:

std::str::from_utf8(&[0x22, 0x0a, 0xe2, 0x98, 0x83, 0, 0x22]) // <- the second 0x83 is replaced with 0
// Ok("\"\n☃\u{0}\"")

One difficulty which might arise, is that the UTF-8 invariant also needs to be uphold if the deserialization code panics somewhere, like a user-written deserialization function.

As a guideline, you should never be forced to compare the bytes of the string in one of your tests, like you do for test_from_mut_str_newline_snowman. Instead, since the string needs to remain a valid string, you must always be able to compare it to a string literal, for example, like you do with test_from_mut_str_snowman.

I hope this explanation helps you to understand the issue at hand better. I am looking forward to seeing an in situ parser :)

pickfire commented 4 years ago

@jonasbb Wow, thanks a lot for reviewing this. Nice.

The issue at hand is, that your code modifies the byte array of the String in a way, that the remaining bytes after the deserialization step are no longer valid UTF-8. However, this is undefined behavior (as @dtolnay already mentioned), because the String type assumes that the storage bytes are always valid UTF-8.

Now I know, I did not notice the always, I thought keeping the remaining bytes are still good based on the initial design.

Hex Dec Binary

How do you generate that? It is troublesome for me to compare byte by byte by remembering the numbers.

The last character could be anything or even some runtime error, since this is undefined behavior, and we do not see the last ", which should still be seen in the s string after the replacement.

Yup, I noticed the last 'â'.

As a guideline, you should never be forced to compare the bytes of the string in one of your tests, like you do for test_from_mut_str_newline_snowman. Instead, since the string needs to remain a valid string, you must always be able to compare it to a string literal, for example, like you do with test_from_mut_str_snowman.

The reason I did that is that I do not know how to represent it in valid UTF-8 (and it took me quite a while to figure out to represent it in bytes), now I need to figure out how to zeroed the slice.

After shifting the bytes forward due to the in situ replacement of an escape sequence, you need to make sure that the end bytes are still valid UTF-8.

I was just thinking about that when I was halfway reading that, I did not thought about this since I did not know that String needs to be always valid UTF-8.

jonasbb commented 4 years ago

The issue at hand is, that your code modifies the byte array of the String in a way, that the remaining bytes after the deserialization step are no longer valid UTF-8. However, this is undefined behavior (as @dtolnay already mentioned), because the String type assumes that the storage bytes are always valid UTF-8.

Now I know, I did not notice the always, I thought keeping the remaining bytes are still good based on the initial design.

The original design in the first post only shows the ASCII subset of Unicode. Here, each codepoint is encoded in a single byte and any shifting modifications will still leave it as valid ASCII. The problem occurs with characters outside of the ASCII subset, which require 2-4 bytes in UTF-8, since here chopping of the beginning or end of the encoded codepoint leads to invalid UTF-8.

Hex Dec Binary

How do you generate that? It is troublesome for me to compare byte by byte by remembering the numbers.

The numbers I generated with this snippet, the annotations with // are by hand.

for b in s.as_bytes() {
    println!("0x{:0>2x} {:>3} {:0>8b}", b, b, b);
}

pickfire commented 4 years ago

Patch updated with zero filling the extra bytes. https://github.com/pickfire/json/commit/e4f7b208ea22ff69169c3cbeed06032f0f9ca532

Feel free to check on it. After that, I will get on with the others '\'.

pickfire commented 4 years ago

@jonasbb https://github.com/pickfire/json/commit/21379e3b39e8194dd1f16b3e08c8feea6862a3ab Added the rest, interested to check it out? I think just left the test, doc and maybe some code clean up.

pickfire commented 4 years ago

Comparing the performance of benchmark recently. Looks like from_mut_str is kinda faster but does not seemed significant (~2%), maybe need to do some more tweaking. _value benchmarks uses serde_json::Value instead of Twitter.

running 2 tests (results from cfarm)
test bench_deserialize_from_mut_str       ... bench:   5,414,787 ns/iter (+/- 40,631) = 116 MB/s
test bench_deserialize_from_mut_str_value ... bench:  16,141,467 ns/iter (+/- 70,416) = 39 MB/s
test bench_deserialize_from_str           ... bench:   5,820,357 ns/iter (+/- 497,791) = 108 MB/s
test bench_deserialize_from_str_value     ... bench:  15,994,963 ns/iter (+/- 405,180) = 39 MB/s

#[bench]
fn bench_deserialize_from_str(b: &mut Bencher) {
    let j = input_json();
    b.bytes = j.len() as u64;
    b.iter(|| {
        serde_json::from_str::<Twitter>(&j).unwrap();
    });
}

#[bench]
fn bench_deserialize_from_mut_str(b: &mut Bencher) {
    let j = input_json();
    b.bytes = j.len() as u64;
    b.iter(|| {
        let mut j = j.clone();
        serde_json::from_mut_str::<Twitter>(&mut j).unwrap();
    });
}

Edit: Running locally for the first time got me different results. Weird. (I took the best case for mut over 3 runs)

test bench_deserialize_from_mut_str       ... bench:   2,006,747 ns/iter (+/- 89,923) = 314 MB/s
test bench_deserialize_from_mut_str_value ... bench:   4,358,211 ns/iter (+/- 1,145,037) = 144 MB/s
test bench_deserialize_from_str           ... bench:   2,357,205 ns/iter (+/- 64,868) = 267 MB/s
test bench_deserialize_from_str_value     ... bench:   5,354,086 ns/iter (+/- 233,092) = 117 MB/s

pickfire commented 4 years ago

@dtolnay The current implementation is unsound where it have a footgun such that it will have undefined behavior if people try to deserialize it twice (probably no reason to do it but it is troublesome once they do it), should we make it unsafe for this? Is there anyway to take ownership of the string and prevent people from calling the function twice?

dtolnay commented 4 years ago

Can you share some compilable sample code that demonstrates the undefined behavior?

pickfire commented 4 years ago

Just do the same thing twice. Code below not tested but shows a rough idea.

use serde::Deserialize;

#[derive(Deserialize, Debug)]
struct User {
    fingerprint: String,
    location: String,
}

fn main() {
    // The type of `j` is `&mut str`
    let mut j = "
        {
            \"fingerprint\": \"0xF9BA143B95FF6D82\",
            \"location\": \"Menlo Park, CA\"
        }".to_owned();

    let u: User = serde_json::from_mut_str(&mut j).unwrap();
    let u: User = serde_json::from_mut_str(&mut j).unwrap();  // <-- this
    println!("{:#?}", u);
}

But you see the difference for from_str and from_mut_str in https://github.com/serde-rs/json/pull/627/files#diff-afb2ab3fe17077dcacb9d8b9f1546945R14-R50 which I have issues when I was writing the benchmarks, reusing the same data for deserialization. Just remove the clone in https://github.com/serde-rs/json/pull/627/files#diff-afb2ab3fe17077dcacb9d8b9f1546945R37 then it will error out.

dtolnay commented 4 years ago

I don't understand why that would lead to undefined behavior. The second call would probably return an error because the string is no longer valid JSON, but that's fine.

pickfire commented 4 years ago

Wait, I meant an error (now I am wondering if that issue happens before I pad the extra data with 0 to prevent invalid unicode characters) but yes it may still be unexpected.

dtolnay commented 4 years ago

Potentially returning errors wouldn't be a reason to make a function unsafe.

pickfire commented 4 years ago

@dtolnay Thanks for the review. But how should I start with a separate crate? Do I just implement the same things I did for the serde-json crate?

dtolnay / request-for-implementation

Serde format for in situ JSON deserialization #7