cloudwego / sonic-rs

A fast Rust JSON library based on SIMD.
https://crates.io/crates/sonic-rs
Apache License 2.0
487 stars 31 forks source link

feat: `get_many` returns `Vec<Rc<Result<..>>>` #108

Closed CPunisher closed 2 months ago

CPunisher commented 2 months ago

I try to make get_many return Vec<Result<LazyValue>> instead of Result<Vector<LazyValue>>. I think the purpose is to generate separate errors for each query path rather than one overall error.

Now everything seems ok, but there are still some problems and I'm hoping for some suggestions:

  1. It's hard to return Vec<Result<LazyValue>> directly. As my code shows, a solution is replacing every element in out with the same error.

    1. get_many has to validate utf8 input. This is a validation for the whole input.
    2. We should handle the result of top level self.get_many_rec call in get_many. https://github.com/cloudwego/sonic-rs/blob/99b37cd1ff4dfdf5b9ee8b45c91f1403f7c0257a/src/parser.rs#L2119
  2. I wrap Result<LazyValue<'_>> in Rc: Motivation: We need to clone the LazyValue for all possible orders. However, Result is not clonable since Error is not clonable. Implement Clone for Error is either impossible, because the inner std::io::Error is not clonable. https://github.com/cloudwego/sonic-rs/blob/99b37cd1ff4dfdf5b9ee8b45c91f1403f7c0257a/src/parser.rs#L1859 Pros: Performance is optimized by avoiding lv.clone() Cons: It's hard to use the result. As the example shows, we have to write many[0].as_ref().as_ref().unwrap().as_raw_str() compared to the original many[0].as_raw_str()

Any suggestion?

liuq19 commented 2 months ago

thanks a lot. we could support get_by_schema here, and the get_many API is hard to use

CPunisher commented 2 months ago

thanks a lot. we could support get_by_schema here, and the get_many API is hard to use

Is get_by_schema a new API to replace get_many ?

liuq19 commented 2 months ago

yeah, like this https://github.com/bytedance/sonic-cpp/pull/85

liuq19 commented 2 months ago

Would you be interested?

CPunisher commented 2 months ago

Would you be interested?

I see. I will investigate and have a try.

CPunisher commented 2 months ago

Would you be interested?

Hello, after investigating, running and debugging schema parsing API on sonic_cpp, I'm very confused about the specification of it (what does it do and how does it replace get_many).

Could you please provide any extra information?

liuq19 commented 2 months ago

as follows: This RFC is not yet mature, we can discuss some corner cases more and update it.

compare with get_many(now): pros: easy to write, and the schema can convert from any Serde type, for example, we can use to_value to convert a map into a schema. cons: cannot represent the index of an array

use sonic_rs::json;

fn get_by_schema(data: &str, mut schema: sonic_rs::Value) -> sonic_rs::Value {
    todo!()
}

fn main() {
    // use a dynamic value to reperesent the JSON schem, now we use `sonic_rs::Value`
    let mut schema = json!({
        "a": null, // default value is `null`
        "b": {
            "b1": {},
            "b2": "default string" // default value is string
        },
        "c": [], // default value is []
    });

    let data = r#"
        {
            "a": {},
            "b": {
                "b1": 123,
            },
            "c": [1, 2, 3],
            "d": "balabala..."
        }
    "#;

    // parse json data by schem, we can parse into the schema value inplace
    let got: sonic_rs::Value = get_by_schema(data, schema);
    assert_eq!(
        got,
        json!({
            "a": {},
            "b": {
                "b1": 123,
                "b2": "default string"
            },
            "c": [1, 2, 3]
        })
    );
}
CPunisher commented 2 months ago

as follows: This RFC is not yet mature, we can discuss some corner cases more and update it.

compare with get_many(now): pros: easy to write, and the schema can convert from any Serde type, for example, we can use to_value to convert a map into a schema. cons: cannot represent the index of an array

use sonic_rs::json;

fn get_by_schema(data: &str, mut schema: sonic_rs::Value) -> sonic_rs::Value {
    todo!()
}

fn main() {
    // use a dynamic value to reperesent the JSON schem, now we use `sonic_rs::Value`
    let mut schema = json!({
        "a": null, // default value is `null`
        "b": {
            "b1": {},
            "b2": "default string" // default value is string
        },
        "c": [], // default value is []
    });

    let data = r#"
        {
            "a": {},
            "b": {
                "b1": 123,
            },
            "c": [1, 2, 3],
            "d": "balabala..."
        }
    "#;

    // parse json data by schem, we can parse into the schema value inplace
    let got: sonic_rs::Value = get_by_schema(data, schema);
    assert_eq!(
        got,
        json!({
            "a": {},
            "b": {
                "b1": 123,
                "b2": "default string"
            },
            "c": [1, 2, 3]
        })
    );
}

According to your example, I can know:

  1. If a key path, for example ["b", "b1"] exists in the schema, but does not exist in the data, then the data will be filled with the corresponding default value in the schema.
  2. If a key path does'not exists in the schema, but exists in the data, then the key value pair will be removed from the data.

Also, I find the cpp version respects orders of fields. This could be one of my most confused points. For example:

  std::string json_schema = R"(
  {"obj":1}
  )";
  std::string json = R"(
{"it":1, "obj":{"a":{"b":1}, "b":[1]}}
  )";
// output: {"obj":{"a":{"b":1},"b":[1]}}
  std::string json_schema = R"(
  {"obj":1}
  )";

  std::string json = R"(
{ "obj":{"a":{"b":1}, "it":1,"b":[1]}}
  )";
// output: {"obj":{"a":{"b":1},"it":1,"b":[1]}}
  std::string json_schema = R"(
  {"it":1, "c":[1], "obj":{"a":{"b":1}}}
  )";

  std::string json = R"(
  {"it":1, "b":[1], "obj":{"a":{"b":1}}}
  )";
// output: {"it":1,"c":[1],"obj":{"a":{"b":1}}}

Are these all expected behaviors?

It seems that schemas are designed to make some transformation to the data. But the document shows that get_many is used for lazily getting many nested values by keys. How can schema do that?

liuq19 commented 2 months ago

The behavior is right.

The get_many must construct a PointerTree through multiple PointerPath. The tree can be also built into a schema, and we also can get all the targeted fields from a json by the schema.

If using get_by_schema, the user can also get the target fields through the key index, such as root["a"]["b"], which will not need to care about the index of PointerPath.

Thus, get_by_schema is easier to get_many. and the hard case is how to deal with the array index.