RFC: Add StAX/Pull-style parsing interface

dlang-community / SDLang-D

An SDLang (Simple Declarative Language) library for D

http://sdlang.org

Other

120 stars 21 forks source link

RFC: Add StAX/Pull-style parsing interface #14

Closed Abscissa closed 10 years ago

Abscissa commented 10 years ago

SDLang-D needs to add a StAX/Pull-style interface, bypassing the DOM. The current DOM would be adjusted to work on top of this pull parser interface.

Before I get started, I'm posting an initial design here to solicit feedback.

Pinging @marler8997 and @s-ludwig

Proposed design:

/+
PullParser: an InputRange

This is a class to prevent unintended copying (since
this isn't a ForwardRange).

Example:
------------------
parent 12 attr="q" {
    childA 34
    childB 56
}
lastTag
------------------

The PullEvent sequence of emitted for that SDL document would be as
follows (indented for readability):
------------------
PullTagStart (parent)
    PullValue (12)
    PullAttribute (attr, "q")
    PullChildrenStart
        PullTagStart (childA)
            PullValue (34)
        PullTagEnd
        PullTagStart (childB)
            PullValue (56)
        PullTagEnd
    PullChildrenEnd
PullTagEnd
PullTagStart  (lastTag)
PullTagEnd
PullFileEnd
------------------
+/
final class PullParser {

    [...stuff...]

    PullEvent front;
    void popFront() {...}

    // Skips to current tag's PullChildrenStart event.
    // If this tag has no children, skips to this
    // tag's PullTagEnd
    void skipToChildren() {...}

    // Skips to the event immediately after the PullTagEnd
    // of the current tag.
    void skipPastTag() {...}

    // Skips to the event immediately after the PullTagEnd
    // of the current tag's parent.
    void skipPastParent() {...}
}

// InputRage Parser's element:
alias PullEvent = std.variant.Algebraic!(
    PullTagStart,
    PullTagEnd,
    PullValue,
    PullAttribute,
    PullChildrenStart,
    PullChildrenEnd,
    PullFileEnd,
);

// Types of events:
struct PullTagStart {
    string namespace;
    string name;
}
struct PullTagEnd { /+nothing+/ }
struct PullValue {
    sdlang.token.Value value; // SDLang-D's already existing Value type.
}
struct PullAttribute {
    string namespace;
    string name;
    sdlang.token.Value value; // SDLang-D's already existing Value type.
}
struct PullChildrenStart { /+nothing+/ }
struct PullChildrenEnd { /+nothing+/ }

marler8997 commented 10 years ago

The interface looks reasonable. I'm not sure exactly what the peek function of the Algabraic template get's translated to, so I would probably want to see how it compared to just using a PullType enum with a struct that holds all the information for all cases.

linenumber/column/depth. Would you put those inside the Pull structure?
I would use const(char)[] instead string for the name/namespace fields. In most cases the character array being passed to the parser will be mutable since it will likely have been read from a file/socket as opposed to being an immutable string. In your unit tests it will likely be a string, but this isn't the normal use case. If you use const(char)[] instead of string, the user will know the field is not immutable since it most likely won't be anyway.
Maybe you could add an integral parameter to skipPastParent and call it break(depthCount). Then you can break out of as many parent tags as you want, and then make the default depthCount be 1 or something? Now that I think about it, modern languages don't really support this feature when you break from a loop, so maybe this isn't that useful. Just a thought.

Abscissa commented 10 years ago

I'm not sure exactly what the peek function of the Algabraic template get's translated to, so I would probably want to see how it compared to just using a PullType enum with a struct that holds all the information for all cases.

https://github.com/D-Programming-Language/phobos/blob/master/std/variant.d#L696

Not sure what typeid comparison entails, but it doesn't look too bad.

In any case, if there are any significant inefficiencies with std.variant, that should really be considered an issue that std.variant needs to fix. If some part of Phobos is too slow for practical use, then Phobos will need to be fixed anyway.

linenumber/column/depth. Would you put those inside the Pull structure?

Good point. I was avoiding including unnecessary bookkeeping info, but those barely require any real overhead and they're essential for good error messages (and more), so yea, they should be included.

I'm thinking I'll just add an int depth; field to the existing struct Location and toss one of those Locations into each of the Pull* event structs.

I would use const(char)[] instead string for the name/namespace fields. In most cases the character array being passed to the parser will be mutable since it will likely have been read from a file/socket as opposed to being an immutable string. In your unit tests it will likely be a string, but this isn't the normal use case. If you use const(char)[] instead of string, the user will know the field is not immutable since it most likely won't be anyway.

Hmm, that could be a good point. I'll keep that in mind. It's be pointless ATM since it currently only takes strings as SDL input, but it might be good for forward compatibility.

Maybe you could add an integral parameter to skipPastParent and call it break(depthCount). Then you can break out of as many parent tags as you want, and then make the default depthCount be 1 or something?

Not a bad idea really, and it would only be a minor change (skipPastParent will have to track depth anyway). I could see it potentially being useful:

//...(find tag named "someTag")...
//...(scan for child tag named "someChild")...
//...(scan for child tag named "grandkid")...
//...(scan for attribute named "foo")...
//...(grab the value of attribute "foo")...
// Got what we need from someTag, so back out:
pullParser.skipPastParent(2) // Skip past grandparent

skipPastTag could probably just wrapper for skipPastParent(0)

But maybe this skipPastParent(int) should be called skipPastAncestor(int generations):

void skipPastTag() { skipPastAncestor(0); }
void skipPastParent() { skipPastAncestor(1); }
void skipPastGrandparent() { skipPastAncestor(2); }
// Beyond that, the user can writer their own damn wrapper ;)

Now that I think about it, modern languages don't really support this feature when you break from a loop, so maybe this isn't that useful. Just a thought.

Well, D does have "break to label". And this wouldn't really be comparable to breaking out of a loop anyway: more like someDom.parent.parent.parent....parent. or ../../../somedir.

Abscissa commented 10 years ago

Another thing that just occurred to me: Maybe there should also be this:

class PullParser // an InputRange
{
    [...]
    PullParser untilChildren() {...}
    PullParser untilEndOfAncestor(int generations) {...}
    PullParser untilEndOfTag() { return untilEndOfAncestor(0); }
    PullParser untilEndOfParent() { return untilEndOfAncestor(1); }
}

These would return ranges which are just like this (and probably also advance this, since it's only an InputRange, not ForwardRange), but these sub-ranges end at the same places where skipToChildren and skipPastAncestor start. This would make it far easier to utilize std.algorithm to do useful things, like:

import std.algorithm;

void skipPastAncestor(PullParser pullParser, int generations) {
    pullParser.untilEndOfAncestor.skipOver!"true"();
}

// Skips to an attribute *within* the current tag, otherwise 
// consumes until the tag's children (if any) are reached.
void skipToAttribute(PullParser pullParser, string attrName) {
    pullParser.untilChildren.find!"a is type PullAttribute && a.name == attrName"();
}

// Finds a descendant tag with a given name
void skipToDescendant(string name) {
    pullParser.untilEndOfTag.find!"a is type PullTagStart && a.name == name"();
}

Pretty cool, I think. And later on, a buffering wrapper could be created to buffer the output of PullParser and convert it to a full-fledged ForwardRange, making the above stuff that much more useful (at the cost of having to store all of the PullEvents so they can be replayed in a .save()'ed copy).

All of this does, however, introduce the question of allocating the new PullParser class. Perhaps instead of being a class, PullParser should just be a struct with a @disabled postblit.

marler8997 commented 10 years ago

Having the 'until...' functions return a small wrapper struct would be nice so then you wouldn't have to allocate memory for them. The struct would have a pointer to the Pull parser class and possibly a target depth. It would be small enough to pass it by value around on the stack.

Abscissa commented 10 years ago

Having the 'until...' functions return a small wrapper struct would be nice so then you wouldn't have to allocate memory for them. The struct would have a pointer to the Pull parser class and possibly a target depth. It would be small enough to pass it by value around on the stack.

That would defeat the point of the original range being a class. And there's really no point in making the parser itself and the "until" stuff use totally different types - that would just be a mess with no real benefit.

Here's the main problem:

Even though it's not yet supported, I want to keep the door open for parsing directly from a stream. This means the pull parser must be an InputRange, not a ForwardRange. No saving state and continuing from a previous state, because streams can't handle that. (Plus, even if I did make it a ForwardRange and allow .save()ing the parser, I would want to ensure it's done explicitly via .save(), never by accidental struct copying.)

The idea of making the parser InputRange a class instead of a struct was specifically to prevent problems from accidentally duplicating it (since the state can't be duplicated, being a mere InputRange). If the until* stuff returns structs, then those can be trivially duplicated on accident and I may as well just make the original parser the same struct as well. Which I could do, although I'd still have to re-solve the original problem somehow.

However, as I mentioned above, instead of using a class, maybe it'd be better to use a struct with postblit set to @disable.

Another possibility (but one I don't want to do) is this: I could just be careful to make sure all copies of the struct are guaranteed to all stay perfectly in-sync. The problem is, there is nothing in D which prevents algorithms from accepting an InputRange and using struct copying in places where they should be using ".save()" and enforcing a ForwardRange. Yes, this is bad style and discouraged, but for many struct-based ForwardRanges this would work fine and go completely unnoticed. Even some struct-based InputRanges may just happen to work by pure change (or merely appear to work). If the pull parser hits one of these algorithms, then things will silently go wrong. I want to block that possibility.

marler8997 commented 10 years ago

Hmmmm...either I didn't explain my struct idea well enough, or I'm not understanding something. As I understand it, the "until..." functions are returning InputRanges. Oh well no big deal. I decided to write an example of how someone might want to use this kind of API, I've included the example with notes:

// Parsing the following SDL
// parent {
//     child "Jimmy"
//     child "Jamie"
// }

// Note: should probably have an "extension" method for the Albegbraic struct.  This could
//       be used for error messages to the user.
void nameForUser(ref PullEvent event) {
  if(event.peek!PullTagStart()) return "TAG_START";
  //...
  else assert(0);
}

PullParser parer = PullParser(sdl); // I would make this a struct, I don't
                                    // see a reason to make it a class.  Maybe I'm
                                    // missing something?

PullTagStart* tagStart;

foreach(event; parser) {
  // if(event.peek!PullFileEnd)
  // Note: don't need to check this because the empty function should take care of this

  tagStart= event.peek!PullTagStart();
  if(tagStart) {
    if(tagStart.name == "parent" ) {
      if(parser.skipToChildren()) {
        // skipToChildren returns the number of values/attributes it skipped
        writefln("Invalid SDL: the parent tag shouldn't have any values or attributes");
      }
      // Also: instead of skipToChildren, you could have enforceNoChildren or something

      writefln("found 'parent' tag");
      // children returns a struct that points to parser and implements an input
      // range that goes until the parser has run out of children at the current depth
      foreach(parentTagEvent; parser.children()) {
        tagStart = parentTagEvent
        if(!tagStart) error("expected TAG_START but got %s", parseTagEvent.nameForUser);
        if(tagStart.name != "child") error("parent tag can only have child tags named 'child' but got '%s'", tagStart.name);
        auto value = parser.popValue(); // returns null if next event is not a tag value
        if(!value) error("child tag must have a value");
        string childName = value.value!string.dup; // copy the name
        if(parser.skipPastTag()) error("child tag cannot have more than one value");
        writefln("found 'child \"%s\"' tag", childName);
      }
    } else {
      error("expected root tag to be 'parent' but was '%s'", tagStart.name);
    }
  } else {
    error("expected TAG_START but got %s", event.nameForUser);
  }

}

I did do some performance testing on using an Algebraic vs a struct with a PullType enum. The peek method was about 10 times slower than using an enum comparison. I looked a little at the implementation and it looks like he Algebraic template implements the peek method by attempting to cast the value to the given type. Not a big deal but something to consider. If I were to use the enum method the PullEvent would look something like this:

enum PullEventType { tagStart, tagEnd, value, attribute, childrenStart, childrenEnd }
struct PullEvent {
    PullEventType type;
    string name;
    sdlang.token.Value value; // SDLang-D's already existing Value type.
    string namespace;
    Location location; // Your location struct    
}

Also I just realized. If your PullEvent has the depth (inside the Location struct), you don't really need the childrenStart and childrenEnd event types. Instead, all you need to do is check whether the depth has changed.

Anyway, there was alot of information in this post. Feel free to take or leave any of it:) I'm not saying that you should implement it any certain way, I just wanted to give you an example and point out some things to consider.

Abscissa commented 10 years ago

I did do some performance testing on using an Algebraic vs a struct with a PullType enum. The peek method was about 10 times slower than using an enum comparison.

That was with compiler optimizations enabled? I would imagine inlining alone would make a big difference there.

If the issue is indeed the internals of Algebraic.peek, then I'd think it should be solvable by modifying Algebraic to internally use a compiletime-dynamically-generated enum instead of typeinfos and casting.

enum PullEventType { tagStart, tagEnd, value, attribute, childrenStart, childrenEnd }
struct PullEvent {
    PullEventType type;
    string name;
    sdlang.token.Value value; // SDLang-D's already existing Value type.
    string namespace;
    Location location; // Your location struct    
}

I don't like giving the user any fields that aren't applicable to the current event type. That's bug-prone programming by convention.

Instead, I'd use a union, with an enum denoting which union member should be used (ie, a classic tagged union). To enforce using only the correct version of the union, I'd wrap it up in a struct providing an interface which enforces correct usage. Problem is, at that point, I've just re-invented std.variant.Algebraic. Better to just fix std.variant.Algebraic.

Also I just realized. If your PullEvent has the depth (inside the Location struct), you don't really need the childrenStart and childrenEnd event types. Instead, all you need to do is check whether the depth has changed.

Technically, that's true, but it forces a little bit of extra bookkeeping onto every user of the lib, and for no substantial benefit that I can see. If they really want to do it that way, they can just ignore the childrenStart and childrenEnd events. Although I'm not sure why they'd want to anyway. It would be a little bit more work and, again, no real benefit.

Abscissa commented 10 years ago

Come to think of it, the PullChildrenStart and PullChildrenEnd events appear to be completely unnecessary regardless of whether or not the depth is provided. They're completely redundant with the PullTagStart and PullTagEnd events.

Abscissa commented 10 years ago

Ok, the StAX/Pull-style parser is basically in and working as of commit 37f8c5303b.

A few things are a little different from what I described above, but it's basically the same general idea. It's pretty well-documented anyway.

There is more I'd still like to do with it, like internally ensure reference semantics and generalize it into what would essentially be a compile-time twist on SAX/Push-style parsing, which everything else (this StAX/Pull interface, the old DOM interface, and even a traditional runtime-based SAX/Push-style one) would all be directly built on.

Speeding up phobos's Variant is something we can still check into. Failing that, we can always build an alternate StAX/Pull interface on top of the soon-to-come compile-time generic version if we REALLY need to, but I REALLY don't think maximal speed is all that critical at the moment, especially for DUB's light usage of it for reading tiny little config files.