serialization/deserialization behavior customization

pkerichang commented 5 months ago

Hello,

I just found out about this library today and I was just trying things out. Overall I was able to get things working, however, there are some serialization/deserialization behaviors I would like to customize, but couldn't figure out how from the documentation.

For reference, my code is effectively doing the following:

// deserialization
auto content = c4::to_csubstr(data);  // convert from std::string_view to c4::csubstr
T obj;  // T is a scalar type like int, char, double, etc.
auto tree = ryml::parse_in_arena(content);
tree.rootref() >> obj;

// serialization
ryml::Tree out_tree;
out_tree.rootref() << obj;

std::ostream ostream(&buf);  // buf is a boost IO stream buffer that writes to a std::string
ostream << out_tree;

By trying various scalar values, what I noticed is that:

If I pass in a negative number like -123 or -2.35e-10, the serialized output string is always quoted with single quotes, like '-123' and '-2.35e-10'
If I pass in a single space character with single quotes like ' ', the output is a double quoted string like " "
If I pass in a number with a plus sign like +123 or +3.5e+6, it raises an exception.
As far as I can tell from documentation, there is no overflow/underflow detection by default, but I couldn't figure out how.

I was wondering if I can tweak the parsing behavior, so that:

for numbers, don't add any quotes.
If it adds quotes around scalars, at least be consistent (either always use single quotes or double quotes).
Ideally, support parsing the + sign in front of ints or doubles
enable overflow/underflow detection.

Is it possible to customize the library with these changes, or do I need to use some adaptor/intermediate custom type tricks to do these customizations? If I do need to use adaptors/custom types, is there some examples that I can look at? Thanks!

biojppm commented 4 months ago

How are you serializing/deserializing your type T?

pkerichang commented 4 months ago

I am serializing/deserializing with the exact code I posted. It is a template method, then I call it with T=int, double, chat, etc , with just the input strings containing the corresponding types

biojppm commented 4 months ago

Thanks for reporting. I was able to reproduce with the following test code:

template<class T>
void test442(csubstr input, csubstr expected)
{
    // as a seq member
    {
        const std::string input_str = formatrs<std::string>("- {}", input);
        const std::string expected_str = formatrs<std::string>("- {}", expected);
        const Tree tree = parse_in_arena(to_csubstr(input_str));
        T obj;  // T is a scalar type like int, char, double, etc.
        tree[0] >> obj;
        EXPECT_EQ(expected_str, emitrs_yaml<std::string>(tree));
        Tree out_tree;
        out_tree.rootref() |= SEQ;
        out_tree[0] << obj;
        EXPECT_EQ(expected_str, emitrs_yaml<std::string>(out_tree));
    }
    // as a map member
    {
        const std::string input_str = formatrs<std::string>("val: {}", input);
        const std::string expected_str = formatrs<std::string>("val: {}", expected);
        const Tree tree = parse_in_arena(to_csubstr(input_str));
        T obj;  // T is a scalar type like int, char, double, etc.
        tree["val"] >> obj;
        EXPECT_EQ(expected_str, emitrs_yaml<std::string>(tree));
        Tree out_tree;
        out_tree.rootref() |= MAP;
        out_tree["val"] << obj;
        EXPECT_EQ(expected_str, emitrs_yaml<std::string>(out_tree));
    }
    // as a doc scalar
    {
        const std::string expected_str(expected.str, expected.len);
        const Tree tree = parse_in_arena(input);
        T obj;  // T is a scalar type like int, char, double, etc.
        tree.rootref() >> obj;
        EXPECT_EQ(expected_str, emitrs_yaml<std::string>(tree));
        Tree out_tree;
        out_tree.rootref() << obj;
        EXPECT_EQ(expected_str, emitrs_yaml<std::string>(out_tree));
    }
}
TEST(serialize, issue442_00)
{
    test442<int>("123", "123\n");
}
TEST(serialize, issue442_01)
{
    test442<int>("-123", "-123\n");
}
TEST(serialize, issue442_02)
{
    test442<int>("+123", "123\n");
}
TEST(serialize, issue442_10)
{
    test442<float>("2.35e-10", "2.35e10\n");
}
TEST(serialize, issue442_11)
{
    test442<float>("-2.35e-10", "-2.35e-10\n");
}
TEST(serialize, issue442_12)
{
    test442<float>("+2.35e-10", "2.35e10\n");
}
TEST(serialize, issue442_20)
{
    test442<double>("2.35e-10", "2.35e10\n");
}
TEST(serialize, issue442_21)
{
    test442<double>("-2.35e-10", "-2.35e-10\n");
}
TEST(serialize, issue442_22)
{
    test442<double>("+2.35e-10", "2.35e10\n");
}
TEST(serialize, issue442_30)
{
    test442<char>("'a'", "'a'\n");
}
TEST(serialize, issue442_31)
{
    test442<char>("' '", "' '\n");
}
TEST(serialize, issue442_40)
{
    test442<char>("\"a\"", "\"a\"\n");
}
TEST(serialize, issue442_41)
{
    test442<char>("\" \"", "\" \"\n");
}

biojppm commented 4 months ago

enable overflow/underflow detection.

The good news is that the c4 library contains facilities to detect overflow; see c4::overflows<T>(csubstr) in the header charconv.hpp. The bad news is that this is explicit opt-in behavior; this is done this way because overflow detection has a noticeable deserialization cost. Do something like this:

T obj;
if(ryml::overflows<T>(node.val()))
    ERROR("overflow detected");
else
   node >> obj;

Note that overflows<T>() requires that T is an integral fundamental type; it is SFINAE'd for signed vs unsigned types. There is no overflow detection for float or double; for this, you will have to use other deserialization facilities.

for numbers, don't add any quotes.

This is the intended behavior. Somehow, the existing tests did not catch this problem.

If it adds quotes around scalars, at least be consistent (either always use single quotes or double quotes).

Adding quotes is not intended behavior. The intended behavior is the following:

trees parsed from YAML source retain the existing style in the source. That is, a single-quoted scalar will be flagged with that style, and likewise for every other scalar type (be it double-quoted, plain, folded or literal). Likewise for containers, ie flow vs block containers will be adequately tagged. The emitter must respect these flags when emitting.
when emitting trees created programatically which do not have style flags set for the nodes, the emitter will figure out what style should be used for each scalar, resorting to some reasonable heuristics. Numbers should definitely not be quoted.
If the style flags were (correctly) set programatically, then it is like the first case, and again the emitter must respect these flags.

In short adding quotes to parsed plain scalars is definitely a bug. Doesn't matter if they are numbers or alphabetical sequences.

Ideally, support parsing the + sign in front of ints or doubles

This is not there at the moment, and I agree it should be there.

Hold on while I investigate and fix the problem.

biojppm commented 4 months ago

I overlooked something in your report. In your code, where you do

// serialization
ryml::Tree out_tree;
out_tree.rootref() << obj;

... you are not setting the style flag. So the resulting node will have no style, and will be subjected to the emitter style heuristics, as an instance of a tree created programatically, like I detailed above. For example, if we do

out_tree.rootref().set_val_style(tree.rootref().type() & VAL_STYLE); // copy the style of the original parsed tree

or, if that's not possible,

out_tree.rootref().set_val_style(VAL_PLAIN); // assign explicit style

then all the tests now pass, and the only remaining problems are

the leading + sign cannot be parsed by the c4::atoi() implementation
the emitter heuristics for the leading - signs fail because they will quote the scalar even when it is a number.

biojppm commented 4 months ago

There was indeed a problem with the emitter heuristics:

modified   src/c4/yml/node_type.cpp
@@ -153,6 +153,10 @@ bool scalar_style_query_plain(csubstr s) noexcept
         else if(s.sub(2).is_number())
             return true;
     }
+    else if(s.begins_with_any("0123456789.-+") && s.is_number())
+    {
+        return true;
+    }
     return s != ':'
         && ( ! s.begins_with_any("-:?*&,'\"{}[]|>%#@`\r")) // @ and ` are

biojppm / rapidyaml

serialization/deserialization behavior customization #442