Open dtolnay opened 5 years ago
assert_eq!(j, "{\"msg\":\"abc\nxyzz\"}"); // not a commitment in the API, but in practice this is what it will be
Is the extra z
a mistake?
Is this code similar without the deserializing? (playground)
#[derive(Debug, PartialEq)]
struct S<'a> {
msg: &'a str,
}
fn main() {
let j = "{\"msg\":\"hello\\nworld\"}".to_owned();
let p = j.as_ptr();
let actual = S {
// or j.get_unchecked(8..20)
msg: unsafe { std::str::from_utf8_unchecked(std::slice::from_raw_parts(p.add(8), 12)) },
};
let expected = S {
msg: "hello\\nworld",
};
assert_eq!(actual, expected);
assert_eq!(actual.msg.as_ptr(), j[8..].as_ptr());
}
It is not a mistake.
No, that code is wrong because the correct deserialized value would contain "hello\nworld"
not "hello\\nworld"
.
Ah, so there is a reduction of \
in place. I did saw that but not sure if it's correct.
But the mistake I mean is the additional z
. Why would xyz
becomes xyzz
?
xyz
doesn't become xyzz
. Instead nxyz
becomes xyzz
. The original z
stays in place because anything else would require more movement of data in memory (4 bytes instead of 3 bytes).
The caller provides memory that has this as a subslice:
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
... | : | " | a | b | c | \ | n | x | y | z | " | } | ...
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
and JSON deserialization must shuffle bytes in some way to produce:
+-----+-----+-----+-----+-----+-----+-----+
... | a | b | c | \n | x | y | z | ...
+-----+-----+-----+-----+-----+-----+-----+
so as you can see we don't need to modify more bytes outside of that.
Ah, I thought we are going to shift all the bytes from the right because the last z
would be unused. Is it useful? Imagine having a lot of newline in a string, I know memory it cheap but any use case with that?
I believe maybe like this? (Playground)
#[derive(Debug, PartialEq)]
struct S<'a> {
msg: &'a str,
}
fn main() {
let mut j = "{\"msg\":\"abc\\nxyz\"}".to_owned();
let actual = S {
msg: unsafe {
let p = j.as_mut_ptr().add(12);
*p = b'\n';
std::ptr::copy(p, p.offset(-1), 4);
j.get_unchecked(8..15)
},
};
let expected = S { msg: "abc\nxyz" };
assert_eq!(actual, expected);
assert_eq!(actual.msg.as_ptr(), j[8..].as_ptr());
assert_eq!(j, "{\"msg\":\"abc\nxyzz\"}");
}
shift all the bytes from the right because the last
z
would be unused. Is it useful? Imagine having a lot of newline in a string, I know memory it cheap
No, we shouldn't move around more bytes than necessary. That's more work. It doesn't save memory. The whole input buffer needs to outlive the deserialized object anyway.
I think the right amount of copying in the example would be the equivalent of:
let p = j.as_mut_ptr();
*p.offset(11) = b'\n';
ptr::copy(p.offset(13), p.offset(12), 3);
j.get_unchecked(8..15)
@dtolnay I don't think that matters, for me I believe reducing one offset
operation makes it faster? Maybe a benchmark will tell. (of course we could do it in two pass like simd json to make it faster)
Either way, how should one get started on this?
https://serde.rs/impl-deserializer.html would be the best place to start, and modify the implementation to take &mut str and process escaped strings in place. Once the basics work, you could pull in some of the number parsing from serde_json that isn't in the example code and then put together some benchmarks.
@dtolnay I tried building on top of SliceRead
from serde-json
in which the slice now becomes slice: &'a mut [u8]
but I am stuck in lifetime problem. It took me quite some time to figure out the logic through trial and error, so confusing. When I uncomment https://github.com/pickfire/json/blob/544b50e6c6270abbdea6d1d93deffb0afd02987a/src/read.rs#L671-L672, it shows:
error[E0495]: cannot infer an appropriate lifetime for lifetime parameter in function call due to conflicting requirements
--> src/read.rs:671:37
|
671 | let borrowed = &self.slice[start..end];
| ^^^^^^^^^^^^^^^^^^^^^^
|
note: first, the lifetime cannot outlive the lifetime 's as defined on the method body
at 634:24...
--> src/read.rs:634:24
|
634 | fn parse_str_bytes<'s, T: ?Sized, F>(
| ^^
note: ...so that reference does not outlive borrowed content
--> src/read.rs:671:37
|
671 | let borrowed = &self.slice[start..end];
| ^^^^^^^^^^
note: but, the lifetime must be valid for the lifetime 'a as defined on the impl at 595:6...
--> src/read.rs:595:6
|
595 | impl<'a> SliceMutRead<'a> {
| ^^
= note: ...so that the types are compatible:
expected std::result::Result<read::Reference<'a, 's, _>, _>
found std::result::Result<read::Reference<'_, '_, _>, _>
error: aborting due to previous error
Any idea to solve the lifetime problem?
Thanks to alice in https://users.rust-lang.org/t/help-about-multiple-mutable-borrows/33494/8?u=pickfire, I found using transmute
to transmute the lifetime which cannot be inferred useful to solve this issue. About safety, I believe the reference is valid for as long as 'a
but not 100% sure.
@dtolnay I believe you can take a look at the prototype now at https://github.com/pickfire/json/commit/6bd0238645f6b880dc6770a5c127c5d340335734 in which it only implements a basic \n
replace as for now.
I've only skimmed the code, but it looks like a good start.
This looks problematic though:
let mut s = "\"\\n\u{2603}\"".to_owned();
assert_eq!("\n\u{2603}", from_str::<String>(&s).unwrap());
let undefined_behavior = &[10, 10, 226, 152];
assert_eq!(undefined_behavior, from_mut_str::<&str>(&mut s).unwrap().as_bytes());
assert!(std::str::from_utf8(undefined_behavior).is_err());
For a valid JSON input containing no escape sequences other than \n
, it gives back a &str containing bytes that are not valid utf8, which is undefined behavior.
Is that because I have yet add the other escapes? https://github.com/pickfire/json/commit/6bd0238645f6b880dc6770a5c127c5d340335734#diff-f400f58d5e78694b93db5d4fe5f2a246R681
#[test]
fn test_from_mut_str_newline() {
#[derive(Debug, PartialEq, Deserialize)]
struct S<'a> {
msg: &'a str,
}
let mut j = "{\"msg\":\"\\n\"}".to_owned();
let actual = from_mut_str::<S>(&mut j).unwrap();
let expected = S { msg: "\n" };
assert_eq!(actual, expected);
assert_eq!(actual.msg.as_ptr(), j[8..].as_ptr());
assert_eq!(j, "{\"msg\":\"\n\n\"}"); // not a commitment in the API, but in practice this is what it will be
}
Not a single \n
(fail in mine either) but the above works, maybe just because I have yet implemented the rest.
No I don't think that is the reason. The input string in my repro is "\n☃" which only contains one escape.
Yes, even just a single \n
(1 char) fails because I don't manage those yet except \n
(2 char).
To clarify, my repro contains \
followed by n
followed by ☃
. There are no escape sequences other than \n, and there is no newline character which isn't legal in a JSON string anyway. I believe this is exactly the case you are saying you already handle, but it is producing undefined behavior.
Heads up that I am going to step away from this issue for a while because I am swamped with other things at the moment, but hopefully Discord will be able to provide help if you link them to your repo.
@dtolnay I fixed the snowman bug, I forgot to ignore the memory before the first \
and forgot to move the memory before the closing "
. Can you please help to check again? I tested both ☃\n
and \n☃
and also reduced down to 3 unsafe lines (2 lines are duplicated).
@pickfire I don't think the snowman bug is fixed in revision 2e4b6b08650b41582430ca004b190b57e0a69c2e. Let me try to explain the issue a bit differently, maybe this will help you.
The issue at hand is, that your code modifies the byte array of the String
in a way, that the remaining bytes after the deserialization step are no longer valid UTF-8. However, this is undefined behavior (as @dtolnay already mentioned), because the String
type assumes that the storage bytes are always valid UTF-8.
For a valid JSON input containing no escape sequences other than
\n
, it gives back a &str containing bytes that are not valid utf8, which is undefined behavior.
The basic functionality of your code works. It can correctly deserialize a JSON string:
let mut s = r#""\n☃""#.to_owned();
assert_eq!("\n☃", from_str::<String>(&s).unwrap());
Let's see how the string looks like on a byte level:
Hex Dec Binary
// "
0x22 34 00100010
// \
0x5c 92 01011100
// n
0x6e 110 01101110
// ☃
0xe2 226 11100010
0x98 152 10011000
0x83 131 10000011
// "
0x22 34 00100010
Now, if we use the new method from_mut_str
, we also see that the code parses the JSON correctly:
assert_eq!("\n☃", from_mut_str::<String>(&mut s).unwrap());
If we look at the bytes after the in situ deserialization step, they look like this
Hex Dec Binary
// "
0x22 34 00100010
// \n
0x0a 10 00001010
// ☃
0xe2 226 11100010
0x98 152 10011000
0x83 131 10000011
// Invalid start of UTF-8 codepoint
0x83 131 10000011
// (Garbage) "
0x22 34 00100010
They start out valid and as expected, with the literal newline character and the remaining bytes shifted forwards. But after the snowman symbol (bytes 0xe2, 0x98, 0x83) we see another 0x83. This is not a valid start of a UTF-8 encoded codepoint. It starts with the two bits 10
, which only ever occur as continuation bytes, but never as the start of an encoded codepoint. This now violates the String
assumption that bytes are always valid UTF-8.
Here are two ways to show how this is problematic:
Normally taking the bytes of a string and creating a stringslice from it should always work. After all, the bytes must be valid UTF-8. If we do this here, we receive an error:
std::str::from_utf8(s.as_bytes())
// Results in:
// Err(Utf8Error { valid_up_to: 5, error_len: Some(1) })
Another way is iterating over the chars in s
like:
for c in s.chars() {
println!("{:?}", c);
}
// Output:
// '\"'
// '\n'
// '☃'
// 'â'
The last character could be anything or even some runtime error, since this is undefined behavior, and we do not see the last "
, which should still be seen in the s
string after the replacement.
Now how to fix this? After shifting the bytes forward due to the in situ replacement of an escape sequence, you need to make sure that the end bytes are still valid UTF-8. Overwriting the trailing bytes with 0-bytes should work for Rust:
std::str::from_utf8(&[0x22, 0x0a, 0xe2, 0x98, 0x83, 0, 0x22]) // <- the second 0x83 is replaced with 0
// Ok("\"\n☃\u{0}\"")
One difficulty which might arise, is that the UTF-8 invariant also needs to be uphold if the deserialization code panics somewhere, like a user-written deserialization function.
As a guideline, you should never be forced to compare the bytes of the string in one of your tests, like you do for test_from_mut_str_newline_snowman
. Instead, since the string needs to remain a valid string, you must always be able to compare it to a string literal, for example, like you do with test_from_mut_str_snowman
.
I hope this explanation helps you to understand the issue at hand better. I am looking forward to seeing an in situ parser :)
@jonasbb Wow, thanks a lot for reviewing this. Nice.
The issue at hand is, that your code modifies the byte array of the String in a way, that the remaining bytes after the deserialization step are no longer valid UTF-8. However, this is undefined behavior (as @dtolnay already mentioned), because the String type assumes that the storage bytes are always valid UTF-8.
Now I know, I did not notice the always, I thought keeping the remaining bytes are still good based on the initial design.
Hex Dec Binary
How do you generate that? It is troublesome for me to compare byte by byte by remembering the numbers.
The last character could be anything or even some runtime error, since this is undefined behavior, and we do not see the last ", which should still be seen in the s string after the replacement.
Yup, I noticed the last 'â'
.
As a guideline, you should never be forced to compare the bytes of the string in one of your tests, like you do for test_from_mut_str_newline_snowman. Instead, since the string needs to remain a valid string, you must always be able to compare it to a string literal, for example, like you do with test_from_mut_str_snowman.
The reason I did that is that I do not know how to represent it in valid UTF-8 (and it took me quite a while to figure out to represent it in bytes), now I need to figure out how to zeroed the slice.
After shifting the bytes forward due to the in situ replacement of an escape sequence, you need to make sure that the end bytes are still valid UTF-8.
I was just thinking about that when I was halfway reading that, I did not thought about this since I did not know that String
needs to be always valid UTF-8.
The issue at hand is, that your code modifies the byte array of the String in a way, that the remaining bytes after the deserialization step are no longer valid UTF-8. However, this is undefined behavior (as @dtolnay already mentioned), because the String type assumes that the storage bytes are always valid UTF-8.
Now I know, I did not notice the always, I thought keeping the remaining bytes are still good based on the initial design.
The original design in the first post only shows the ASCII subset of Unicode. Here, each codepoint is encoded in a single byte and any shifting modifications will still leave it as valid ASCII. The problem occurs with characters outside of the ASCII subset, which require 2-4 bytes in UTF-8, since here chopping of the beginning or end of the encoded codepoint leads to invalid UTF-8.
Hex Dec Binary
How do you generate that? It is troublesome for me to compare byte by byte by remembering the numbers.
The numbers I generated with this snippet, the annotations with //
are by hand.
for b in s.as_bytes() {
println!("0x{:0>2x} {:>3} {:0>8b}", b, b, b);
}
Patch updated with zero filling the extra bytes. https://github.com/pickfire/json/commit/e4f7b208ea22ff69169c3cbeed06032f0f9ca532
Feel free to check on it. After that, I will get on with the others '\'
.
@jonasbb https://github.com/pickfire/json/commit/21379e3b39e8194dd1f16b3e08c8feea6862a3ab Added the rest, interested to check it out? I think just left the test, doc and maybe some code clean up.
Comparing the performance of benchmark recently. Looks like from_mut_str
is kinda faster but does not seemed significant (~2%), maybe need to do some more tweaking. _value
benchmarks uses serde_json::Value
instead of Twitter
.
running 2 tests (results from cfarm)
test bench_deserialize_from_mut_str ... bench: 5,414,787 ns/iter (+/- 40,631) = 116 MB/s
test bench_deserialize_from_mut_str_value ... bench: 16,141,467 ns/iter (+/- 70,416) = 39 MB/s
test bench_deserialize_from_str ... bench: 5,820,357 ns/iter (+/- 497,791) = 108 MB/s
test bench_deserialize_from_str_value ... bench: 15,994,963 ns/iter (+/- 405,180) = 39 MB/s
#[bench]
fn bench_deserialize_from_str(b: &mut Bencher) {
let j = input_json();
b.bytes = j.len() as u64;
b.iter(|| {
serde_json::from_str::<Twitter>(&j).unwrap();
});
}
#[bench]
fn bench_deserialize_from_mut_str(b: &mut Bencher) {
let j = input_json();
b.bytes = j.len() as u64;
b.iter(|| {
let mut j = j.clone();
serde_json::from_mut_str::<Twitter>(&mut j).unwrap();
});
}
Edit: Running locally for the first time got me different results. Weird. (I took the best case for mut
over 3 runs)
test bench_deserialize_from_mut_str ... bench: 2,006,747 ns/iter (+/- 89,923) = 314 MB/s
test bench_deserialize_from_mut_str_value ... bench: 4,358,211 ns/iter (+/- 1,145,037) = 144 MB/s
test bench_deserialize_from_str ... bench: 2,357,205 ns/iter (+/- 64,868) = 267 MB/s
test bench_deserialize_from_str_value ... bench: 5,354,086 ns/iter (+/- 233,092) = 117 MB/s
@dtolnay The current implementation is unsound where it have a footgun such that it will have undefined behavior if people try to deserialize it twice (probably no reason to do it but it is troublesome once they do it), should we make it unsafe for this? Is there anyway to take ownership of the string and prevent people from calling the function twice?
Can you share some compilable sample code that demonstrates the undefined behavior?
Just do the same thing twice. Code below not tested but shows a rough idea.
use serde::Deserialize;
#[derive(Deserialize, Debug)]
struct User {
fingerprint: String,
location: String,
}
fn main() {
// The type of `j` is `&mut str`
let mut j = "
{
\"fingerprint\": \"0xF9BA143B95FF6D82\",
\"location\": \"Menlo Park, CA\"
}".to_owned();
let u: User = serde_json::from_mut_str(&mut j).unwrap();
let u: User = serde_json::from_mut_str(&mut j).unwrap(); // <-- this
println!("{:#?}", u);
}
But you see the difference for from_str
and from_mut_str
in https://github.com/serde-rs/json/pull/627/files#diff-afb2ab3fe17077dcacb9d8b9f1546945R14-R50 which I have issues when I was writing the benchmarks, reusing the same data for deserialization. Just remove the clone
in https://github.com/serde-rs/json/pull/627/files#diff-afb2ab3fe17077dcacb9d8b9f1546945R37 then it will error out.
I don't understand why that would lead to undefined behavior. The second call would probably return an error because the string is no longer valid JSON, but that's fine.
Wait, I meant an error (now I am wondering if that issue happens before I pad the extra data with 0 to prevent invalid unicode characters) but yes it may still be unexpected.
Potentially returning errors wouldn't be a reason to make a function unsafe.
@dtolnay Thanks for the review. But how should I start with a separate crate? Do I just implement the same things I did for the serde-json crate?
Originally filed as https://github.com/serde-rs/json/issues/318, but since this will involve quite a bit more unsafe code than currently exists in serde_json I think it could be better to put in a separate crate. Much of the implementation can be drawn from serde_json.
The idea is to implement a function like:
backed by a Serde
Deserializer
that knows how to unescape JSON strings in place.In the example, the incoming bytes in memory would contain a subslice looking like:
and JSON deserialization shuffles them to lay out the correct output string bytes in place while modifying the minimum amount of memory: