breese / trial.protocol

Network wire protocols
11 stars 4 forks source link

Function find_key #26

Open vinipsmaker opened 6 years ago

vinipsmaker commented 6 years ago

Given this JSON:

{
  "abc": [ { "def": 4 }, {} ],
  "def": 42
}

and a reader which is in the root JSON point:

json::reader reader(raw_json);
assert(reader.symbol() == json::token::symbol::begin_object);

find_key(reader, "def") would advance the reader to hold the point/value 42.

An initial non-tested implementation is:

if (reader.symbol() != json::token::symbol::begin_object)
  return false;

if (!reader.next())
  return false;

while (true) {
  // Key
  if (reader.symbol() == json::token::symbol::end_object) {
    reader.next();
    return false;
  }

  if (reader.value<std::string>() == key) {
    reader.next();
    return reader.symbol() != json::token::symbol::error;
  }

  if (!reader.next())
    return false;

  // Value
  skip_element(reader);
  if (reader.symbol() == json::token::error)
    return false;
}

I think this function should only try to find the key if it is given the token at the begin_object token because the semantics are clearer. If you're already iterating over the object keys, there is no point in provide an “easier” function to you, as all you'd need is the skip_element function and stay in the loop (one call to the skip_element in the else clause and nothing more... which is 2 extra lines for user code...).

Also, if you think this function can belong to Trial.Protocol, another function could be close_current_object which would pair with the reader advanced by find_key nicely.

breese commented 6 years ago

I am hesitating a bit, not because I dislike the suggestion, but because I am trying to understand if there is a larger framework lurking behind this.

For example, we could create a "virtual container" that wraps json::reader and provides input iterators (or even forward iterators) such that we could use std algorithms instead.

vinipsmaker commented 6 years ago

but because I am trying to understand if there is a larger framework lurking behind this.

we could create a "virtual container" that wraps json::reader and provides input iterators (or even forward iterators) such that we could use std algorithms instead

Let's leave the issue for sleep for a few months. I'll think about it.

breese commented 6 years ago

I have added the feature/viewer branch to experiment with this.

The branch contains a json::viewer class that acts as such a "virtual container". It is constructed with buffer view, and its iterator uses json::reader.

The only "documentation" is test/json/viewer_suite.cpp.

vinipsmaker commented 6 years ago

I think level() should be accessible from the iterator as well.

And rename viewer to flat_viewer or flatten_viewer. Do you happen to know any "tree iterator" that was proposed to C++ in the past?

I think this should only be merged to develop once we implement this tree viewer to a binary protocol as well. Not all binary protocols are based on begin_object/end_object tokens and some may just tell the size of the list/object on the begin_object. I wonder what value should level() report or which alternative approach has been proposed in the past. For the flatten_viewer should be easier.

breese commented 6 years ago

I agree about level(). We should also expose symbol(), category(), and error().

breese commented 6 years ago

I would like to keep the viewer name for now. I tend towards adding more iterator types (e.g. recursive iterators) to the viewer class, rather than adding a recursive viewer.

There was an old C++ proposal for adding trees, where the recursive iterators were called cursors.

breese commented 6 years ago

BinToken has both begin-end arrays (symbol::begin_array and symbol::end_array) and type-length-value arrays (symbol::array). The latter is a homogenous container that you cannot iterate into, just like a symbol::string.

vinipsmaker commented 3 years ago

For example, we could create a "virtual container" that wraps json::reader and provides input iterators (or even forward iterators) such that we could use std algorithms instead.

Btw, an iterator for json::reader would have an interface in the lines of:

That's basically what I used in a recent project of mine. I didn't actually reified it into a real iterator, but I manipulated the same data structure that such iterator would use (a json::reader + a std::vector<> path): https://gitlab.com/emilua/emilua/-/blob/798d906db92bc1da4eeed88c223b0d788dcc3521/src/json.cpp#L138

But that's FYI as they say. I'm not asking for an iterator implementation. I'm just reporting back from one recent experience I had.

vinipsmaker commented 3 years ago

So I believe #16 (scanf()) mostly obsoletes the uses I had for find_key. However let's review the Boost.Serialization glue in Trial.Protocol again and we'll see a variation of this idea that fills a very common use case.

Boost.Serialization generates non-introspectable archives. The output format will be meaningless without the C++ source code. The data schema resides not in the final archive itself, but in the C++ code. The final archive is just a bunch of S-exprs in new disguises.

However Trial.Protocol offers the user the chance to control not serialization code for C++ structures, but to define protocols. You made this small observation yourself in one of our email exchanges. So if the user has a type Foobar he can:

template <typename CharT>
struct load_overloader<json::basic_iarchive<CharT>,
                       Foobar>
{
    static void load(json::basic_iarchive<CharT>& ar,
                     Foobar& data,
                     const unsigned int /*protocol_version*/)
    {
        ...
    }
};

That's where we can help with a few algorithms. There's already partial::skip() and partial::parse(). There'll be partial::scanf() and Hana's Struct integration soon. All these algorithms help to map JSON object members to C++ members. For instance, they make easy to hand-written code to map:

{
    "foo": 2,
    "bar": 3
}

into:

struct Foobar
{
    int foo;
    int bar;
};

However the network protocol sometimes defines object hierarchies. And as Marcel Weiher has noted in his call for "less lethargic JSON support" [sic]:

[...] though with some added complexity and less convenience because JSON is a less informative file format than XML. Open- and close-tags really give you a good heads-up as to what's coming that "{" just does not.

We need to look ahead to check for some "type" attribute in the JSON stream just to figure it out what is the specific instance of the hierarchy we should instantiate for the current point in the substream. For instance:

{
    "x": 4,
    "type": "pos",
    "y": 5
}

And the C++ classes:

struct Root
{
    std::vector<std::variant<Position, Image>> collection;
};

struct Position
{
    int x;
    int y;
};

struct Image
{
    // ...
};

So... to help the hand-written serialization code that encodes this network protocol (as opposed to serialization) we could offer an algorithm such as:

template<class F>
bool copy_and_look_ahead(const json::reader& reader, F&& f);

The function copies reader and modifies its own copy to traverse the substream. So it's handling one responsibility and one responsibility only: look ahead. It'll call f passing its internal copy of reader pointing to the desired attribute according to key if found on the stream. The return value just informs whether f was called. Now the user can easily write hand-written code to perform lookaheads and implement such protocols:

enum { POS, IMAGE, UNKNOWN } current;
auto on_value = [&](json::reader& reader) {
    // ... more code to check current type is string ...
    if (reader.value<std::string>() == "pos")
        current = POS;
    else if (reader.value<std::string>() == "image")
        current = IMAGE;
    else
        current = UNKNOWN;
};
std::error_code ec;
if (!json::copy_and_look_ahead(reader, "type", on_value, ec))
    throw std::runtime_error("invalid protocol");
switch (current) {
// ...
}

The use case is clear and the semantics are much easier to figure it out than the initial find_key proposed in this issue. json::partial::scanf() and json::copy_and_look_ahead() alone are plenty of powerful and can cover vast amounts of use cases.

breese commented 3 years ago

I like this approach.

Serialization gives us support for protocols with a schema where data is determined by their position, and this look-ahead functionality gives us support for tag-based protocols.

vinipsmaker commented 1 year ago

There was an old C++ proposal for adding trees, where the recursive iterators were called cursors.

I've been playing with C++17's std::filesystem for the last days, and they already added a recursive iterator to the standard library: https://en.cppreference.com/w/cpp/filesystem/recursive_directory_iterator

There's a “flat” iterator as well for iterating over the same “collection”: https://en.cppreference.com/w/cpp/filesystem/directory_iterator

The tree/recursive iterator is more complex and has a few more methods than the usual ones found on “flat” iterators:

The last two methods fill the same gap as the recurse() method that I suggested earlier. The difference is that std::filesystem::recursive_directory_iterator recurses by default and you opt-out of recursion on an item-by-item basis. For instance, if I'm implementing a tool such as tree, I can implement the option --gitignore by calling disable_recursion_pending() every time the expression gitignorecontains(iterator->path().filename()) evaluates true for the current iterator value.

I think there has been enough stories collected about tree iterators to make a decision already. I used Trial.Protocol on many projects of different needs. On the ones where I could have used a tree iterator, I've reported back the requirements I'd have. Independent teams designed a recursive iterator for std::filesystem, and the requirements conflate. I've played with the recursive iterator from std::filesystem on its problem-space (e.g. ignoring CVS directories on a recursive iteration), and although it wasn't a heavy usage, I do think the interface captured what must be there (hopefully way more people played with this interface before it got in std::filesystem... or at least that's what I expect).

breese commented 1 year ago

I experimented with a recursive iterator a while back. The design consisted of two parts:

I did not proceed with this design because the std algorithms assume that iterators a as simple as pointers and therefore copies iterators rather than moving them. This caused all shorts of odd behavior because they manipulated the copies, not the original state.

Now that I look at the std::filesystem::recursive_directory_iterator implementation, I can see that they use shared_ptr for the state.

Regarding disable_recursion_pending() it requires additional state to keep track of the disabling. I am not familiar with the use-cases that resulted in this design, but I can think of two more attractive design alternatives:

  1. Let operator++() advance recursively, and add an advance() method to advance flatly.
  2. Create a flat iterator that shares state with the recursive iterator.
vinipsmaker commented 1 year ago

I did not proceed with this design because the std algorithms assume that iterators a as simple as pointers and therefore copies iterators rather than moving them. This caused all shorts of odd behavior because they manipulated the copies, not the original state.

For std::filesystem::recursive_directory_iterator, it's easy to find the warning “invalidates all copies of the previous value of *this” on many of its members (e.g. pop(), operator++()).

I am not familiar with the use-cases that resulted in this design [disable_recursion_pending()]

I wrote Lua bindings for std::filesystem::recursive_directory_iterator that just copied the original interface (at least as much as possible... Lua doesn't really uses operator++() for iteration... nor should I expose any unsafe interface such as copying iterator objects that might be invalidated later). Here's an example on how to use it:

for entry, depth in fs.recursive_directory_iterator(fs.path.new(".")) do
    print(string.rep("\t", depth) .. tostring(entry.path:filename()))
end

If one wanted to ignore .git subdirectories:

for entry, depth in fs.recursive_directory_iterator(fs.path.new(".")) do
    if entry.path:filename() ~= ".git" then
        print(string.rep("\t", depth) .. tostring(entry.path:filename()))
    end
end

The use case for ignoring CVS subdirectories common enough that it should be possible to abstract it. For this, we can write the following:

function cvs_exclude(iter, ctrl)
    local function next()
        local entry, depth = iter()
        if entry == nil then
            return
        end

        local p = tostring(entry.path:filename())
        if p == ".git" or p == ".svn" or p == ".hg" then
            ctrl:disable_recursion_pending()
        end
        return entry, depth
    end
    return next, ctrl
end

for entry, depth in cvs_exclude(fs.recursive_directory_iterator(fs.path.new("."))) do
    print(string.rep("\t", depth) .. tostring(entry.path:filename()))
end

The C++17 design helped my Lua abstraction. Now we can not only compose one level of abstraction, but we could further stack more behaviors on top.

However C++ is not Lua. For C++, the same could be achieved with the traditional iterator design (we just need to give a different meaning for operator++() on the wrapping template). I thought it'd be useful to share this story to give light on problems that can make use of C++17's design.

I can think of two more attractive design alternatives [...]

Even if we go with such alternatives, I'd like to also keep the method pop() from C++17.

vinipsmaker commented 1 year ago

On another note, there's another useful side of std::filesystem::recursive_directory_iterator for Trial.Protocol. This iterator deals with IO. IO can always error, so it must report errors even if the invariants for data structures internal to the program are never broken. std::filesystem::recursive_directory_iterator deals with this problem by offering an overload taking a std::error_code argument (and throwing an exception for the traditional overload): https://en.cppreference.com/w/cpp/filesystem/recursive_directory_iterator/increment

That's something Trial.Protocol's iterator could also mirror.

breese commented 1 year ago

I got two experiments mixed up. The recursive iterator I mentioned above was for the dynamic class. The json::viewer class referred to further up above is a wrapper around json::reader with an iterator API. Could you have a look at the branch to see if you agree with the direction.

Adding pop() makes sense.

Regarding increment(error) I find it better to expose json::reader::error() in the json::viewer API. Consider the std::filesystem example:

  it.increment(error);
  switch (error) ...

which becomes as follows with json::viewer::error()

  ++it;
  switch (it.error()) ...

You can also use json::viewer::symbol() is you wish to avoid std::error_code.

The std::filesystem syntax for increment() is a bit odd - it returns a reference to the iterator so you can chain calls together (e.g. *it++) and the error code is returned as an output argument, so you have to break the chain to check for errors.

I am still skeptical about disable_recursion_pending() because it violates the zero overhead principle. I would prefer a separate method to skip to the next sibling.

vinipsmaker commented 1 year ago

Could you have a look at the branch [feature/viewer] to see if you agree with the direction.

It almost works for me, but I also need a path object telling me where I am. I can't imagine a use case where the lack of path information would work.

The std::filesystem syntax for increment() is a bit odd - it returns a reference to the iterator so you can chain calls together (e.g. *it++) and the error code is returned as an output argument, so you have to break the chain to check for errors.

I agree about your assessment. However, putting increment() aside, the (throwing) interface in the operator++() could be the right answer. Algorithms working on C++ iterators don't check for “traversal errors”. Exceptions are really the only hope to make them work.

I am still skeptical about disable_recursion_pending() because it violates the zero overhead principle. I would prefer a separate method to skip to the next sibling.

I'm not sure what the answer is here. For algorithms, suppose:

std::for_each(trial_protocol_iterator, end, [](auto&& x) {
  if (x.path().back() == "foobar")
    x.disable_recursion_pending();
});

The interface disable_recursion_pending() is the only hope the callback has to affect iteration. Maybe yet another wrapper class that adds this behavior on top (and the base class would only expose the explicit method to skip to the next sibling)?

I think the uses cases diverge too much. The example above doesn't need to recurse on every subobject. The use case I had earlier required me to recurse on every subobject. An iterator might be used to reuse the iterators ecosystem, or maybe the user doesn't care about it (the latter is the use case where I fall into).

breese commented 1 year ago

It almost works for me, but I also need a path object telling me where I am. I can't imagine a use case where the lack of path information would work.

That information is currently not stored anywhere. I have to think about how we can add that without introducing unnecessary overhead.

std::for_each(trial_protocol_iterator, end, [](auto&& x) { if (x.path().back() == "foobar") x.disable_recursion_pending(); });

x is the current element, not an iterator, so x.disable_recursion_pending() is not valid.

We would have to resort to a plain for loop, in which case we can do this:

  for (auto it = viewer.begin(), it != viewer.end(); ) {
    if (it->path().back() == "foobar") {
      it.next_sibling(); // Or whatever its name will be
      continue;
    }
    // Do stuff
    ++it;
  }

Another approach we should consider is to make a sibling-iterator instead of a recursive-iterator and let the user recurse when needed.