biojppm / rapidyaml

Rapid YAML - a library to parse and emit YAML, and do it fast.
MIT License
579 stars 99 forks source link

serialization/deserialization behavior customization #442

Closed pkerichang closed 4 months ago

pkerichang commented 4 months ago

Hello,

I just found out about this library today and I was just trying things out. Overall I was able to get things working, however, there are some serialization/deserialization behaviors I would like to customize, but couldn't figure out how from the documentation.

For reference, my code is effectively doing the following:

// deserialization
auto content = c4::to_csubstr(data);  // convert from std::string_view to c4::csubstr
T obj;  // T is a scalar type like int, char, double, etc.
auto tree = ryml::parse_in_arena(content);
tree.rootref() >> obj;

// serialization
ryml::Tree out_tree;
out_tree.rootref() << obj;

std::ostream ostream(&buf);  // buf is a boost IO stream buffer that writes to a std::string
ostream << out_tree;

By trying various scalar values, what I noticed is that:

I was wondering if I can tweak the parsing behavior, so that:

Is it possible to customize the library with these changes, or do I need to use some adaptor/intermediate custom type tricks to do these customizations? If I do need to use adaptors/custom types, is there some examples that I can look at? Thanks!

biojppm commented 4 months ago

How are you serializing/deserializing your type T?

pkerichang commented 4 months ago

I am serializing/deserializing with the exact code I posted. It is a template method, then I call it with T=int, double, chat, etc , with just the input strings containing the corresponding types

biojppm commented 4 months ago

Thanks for reporting. I was able to reproduce with the following test code:

template<class T>
void test442(csubstr input, csubstr expected)
{
    // as a seq member
    {
        const std::string input_str = formatrs<std::string>("- {}", input);
        const std::string expected_str = formatrs<std::string>("- {}", expected);
        const Tree tree = parse_in_arena(to_csubstr(input_str));
        T obj;  // T is a scalar type like int, char, double, etc.
        tree[0] >> obj;
        EXPECT_EQ(expected_str, emitrs_yaml<std::string>(tree));
        Tree out_tree;
        out_tree.rootref() |= SEQ;
        out_tree[0] << obj;
        EXPECT_EQ(expected_str, emitrs_yaml<std::string>(out_tree));
    }
    // as a map member
    {
        const std::string input_str = formatrs<std::string>("val: {}", input);
        const std::string expected_str = formatrs<std::string>("val: {}", expected);
        const Tree tree = parse_in_arena(to_csubstr(input_str));
        T obj;  // T is a scalar type like int, char, double, etc.
        tree["val"] >> obj;
        EXPECT_EQ(expected_str, emitrs_yaml<std::string>(tree));
        Tree out_tree;
        out_tree.rootref() |= MAP;
        out_tree["val"] << obj;
        EXPECT_EQ(expected_str, emitrs_yaml<std::string>(out_tree));
    }
    // as a doc scalar
    {
        const std::string expected_str(expected.str, expected.len);
        const Tree tree = parse_in_arena(input);
        T obj;  // T is a scalar type like int, char, double, etc.
        tree.rootref() >> obj;
        EXPECT_EQ(expected_str, emitrs_yaml<std::string>(tree));
        Tree out_tree;
        out_tree.rootref() << obj;
        EXPECT_EQ(expected_str, emitrs_yaml<std::string>(out_tree));
    }
}
TEST(serialize, issue442_00)
{
    test442<int>("123", "123\n");
}
TEST(serialize, issue442_01)
{
    test442<int>("-123", "-123\n");
}
TEST(serialize, issue442_02)
{
    test442<int>("+123", "123\n");
}
TEST(serialize, issue442_10)
{
    test442<float>("2.35e-10", "2.35e10\n");
}
TEST(serialize, issue442_11)
{
    test442<float>("-2.35e-10", "-2.35e-10\n");
}
TEST(serialize, issue442_12)
{
    test442<float>("+2.35e-10", "2.35e10\n");
}
TEST(serialize, issue442_20)
{
    test442<double>("2.35e-10", "2.35e10\n");
}
TEST(serialize, issue442_21)
{
    test442<double>("-2.35e-10", "-2.35e-10\n");
}
TEST(serialize, issue442_22)
{
    test442<double>("+2.35e-10", "2.35e10\n");
}
TEST(serialize, issue442_30)
{
    test442<char>("'a'", "'a'\n");
}
TEST(serialize, issue442_31)
{
    test442<char>("' '", "' '\n");
}
TEST(serialize, issue442_40)
{
    test442<char>("\"a\"", "\"a\"\n");
}
TEST(serialize, issue442_41)
{
    test442<char>("\" \"", "\" \"\n");
}
biojppm commented 4 months ago

enable overflow/underflow detection.

The good news is that the c4 library contains facilities to detect overflow; see c4::overflows<T>(csubstr) in the header charconv.hpp. The bad news is that this is explicit opt-in behavior; this is done this way because overflow detection has a noticeable deserialization cost. Do something like this:

T obj;
if(ryml::overflows<T>(node.val()))
    ERROR("overflow detected");
else
   node >> obj;

Note that overflows<T>() requires that T is an integral fundamental type; it is SFINAE'd for signed vs unsigned types. There is no overflow detection for float or double; for this, you will have to use other deserialization facilities.

for numbers, don't add any quotes.

This is the intended behavior. Somehow, the existing tests did not catch this problem.

If it adds quotes around scalars, at least be consistent (either always use single quotes or double quotes).

Adding quotes is not intended behavior. The intended behavior is the following:

In short adding quotes to parsed plain scalars is definitely a bug. Doesn't matter if they are numbers or alphabetical sequences.

Ideally, support parsing the + sign in front of ints or doubles

This is not there at the moment, and I agree it should be there.

Hold on while I investigate and fix the problem.

biojppm commented 4 months ago

I overlooked something in your report. In your code, where you do

// serialization
ryml::Tree out_tree;
out_tree.rootref() << obj;

... you are not setting the style flag. So the resulting node will have no style, and will be subjected to the emitter style heuristics, as an instance of a tree created programatically, like I detailed above. For example, if we do

out_tree.rootref().set_val_style(tree.rootref().type() & VAL_STYLE); // copy the style of the original parsed tree

or, if that's not possible,

out_tree.rootref().set_val_style(VAL_PLAIN); // assign explicit style

then all the tests now pass, and the only remaining problems are

biojppm commented 4 months ago

There was indeed a problem with the emitter heuristics:

modified   src/c4/yml/node_type.cpp
@@ -153,6 +153,10 @@ bool scalar_style_query_plain(csubstr s) noexcept
         else if(s.sub(2).is_number())
             return true;
     }
+    else if(s.begins_with_any("0123456789.-+") && s.is_number())
+    {
+        return true;
+    }
     return s != ':'
         && ( ! s.begins_with_any("-:?*&,'\"{}[]|>%#@`\r")) // @ and ` are