Parser specification recommendations

Hey Alex,

Thanks for putting this up on HN! I threw together a really basic parser last night and ran into a couple things that I think are worth clarifying:

1. The spec should require that every item is a line.

That means each line must end with a trailing \n. This helps resolve ambiguity in the specification so far. This also implies that every item, including the last item in the file, needs a trailing \n to parse. That's pretty standard for just about any text editor and many other text file formats, so it should be fine.

2. Regarding annotations in the following form:

Items:
  - Item Y
        [I belong to Item Y]
  - [I belong to Item Z]
          Item Z

I think the second case, an annotation "belonging to" it's child is problematic. My recommendation is to drop it, because I think it's difficult for humans to understand, let alone a parser.

If you did not want to drop that, then I would propose it be reinterpreted as follows:

Items:
  - Item Y
        [I belong to Item Y]
  - [I am an item with no value, only this annotation]
          I'm a child of this value-less item.

Again I think just requiring that a line have a value makes more sense, but if you feel it's critical to keep the format shown above then I think a "valueless" item makes more sense than inverting the relationship of values and annotations. It's confusing to both humans and machines.

3. Annotation types:

UPDATE: based on the discussion below I ended up reconsidering a bunch of these details and instead landed on there being only one kind of annotation, an object like {key: "optional", value: "required"}. Tasks only make sense as a special type of Item. Thus any readers who made it this far can skip past this part :)

In your guidebook you show examples like this:

I'm a plain old item 
   [and I'm an annotation]

and

The Crying of Lot 49
   [author: Thomas Pynchon]
   [publication year: 1966]
   [publisher: J. B. Lippincott & Co.]

But in your sample area your examples show that annotations stored as a hash, with the "index" as the keys.

This several cases unclear.

How should a note with no "index" (ie something before a colon) be stored?
What happens if two annotations are given with the same "index"?

BTW as an aside I think you might want to call those "tags" or "keys" bc. "index" is a little confusing, but... that's not critical.

Combining this with the other annotation types, it becomes difficult to reason about how the underlying data should be represented. I would propose the following:

There are three kinds of Annotations: note, index, and task.
All annotations are enclosed in square brackets [...]
A task is a square bracket enclosing exactly one character: [ ], [x], [✓], followed by any amount of text and terminated by a \n
An "index" (again, name is weird lol), is an annotation with characters separated by a colon: [index: content]
An "index" may occur only once per item. If an index is repeated, the last occurrence will be used.
Any other annotation is considered a note.

That would result in the following:

Big Item
  [this is just a note]
  [this is another note]
  [author: me]
  [category: code example]
  [category: margin example]
  [x] Put this on Github
  [ ] Get it adopted?
  Everything above is an annotation on Big Item, but I am a child.
  Of course big item can have multiple children.

And the JSON representation of this would look like this (omitting the raw stuff):

{
  "value": "Big Item",
  "annotations": {
    "notes": ["this is just a note", "this is another note"],
    "indices": {
      "author": "me",
      "category": "margin example"
    },
    "tasks": [
      { "done": true, "value": "Put this on Github" },
      { "done": false, "value": "Get this adopted?" },
    ]
  },
  "children": [
    { "value": "Everything above is an annotation on Big Item, but I am a child." },
    { "value": "Of course big item can have multiple children."}
  ]
}

I have a few other thoughts but those were the big ones and I know this is a lot to digest, so let me know what you think about all the above. Thanks for sharing your project, this is very cool :)

Mostly agree. Maybe there should be an exception for multi-line annotations with a special syntax? Idk. Syntax could be wrapping it in parentheses or the pipe syntax shown in another issue.
Wholeheartedly agree. This is the single largest issue with writing a parser right now. I’m also working on one in Rust and this is exactly what I ran into. Also agree that it’s unintuitive for humans. If there’s no specific case where it’s obviously really helpful for readability, then it should probably be dropped.
Don’t agree so much with this one. I think it primarily stems from ambiguity around annotation “keys” (aka “types”). For [a: b], a is the key and b is the value. For just [a], I think a is supposed to be a key with an empty value. This simplifies the data model significantly. It’s just key -> Option<value> in pseudo-Rust syntax. Also, having special cases for different kinds of annotations seems to defeat the purpose of margin, which is “open-minded” and flexible. I really like the annotations model and its simplicity.

Additionally, the parser should probably strip leading and trailing whitespace in annotation values so that [a: b] is parsed as a and b instead of a and •b.

@vlmutolo thanks for the feedback.

Re: annotation types, I would push back as follows:

First, storing data as...

{ "this is the actual value": null }

... is ergonomically pretty weird. It becomes more difficult to reference this in many languages as you now cannot figure out what is in the key of that hash unless you iterate them all, which kind of defeats the purpose.

My proposal is to actually make it simply key -> value, by defining a number of top level keys that correspond to the different semantics of different annotation types. I don't think this limits what annotations can do, but rather provides structure around it by embodying the different annotation types so that they can be used more naturally.

For example, in my proposed structure all the tasks are easy to identify and it is trivial to see how many they are and how many of them are done. Imagining I was working in javascript for a moment, working with the example in my original comment:

marginRoot.children.find( item => item.value == "Big Item" ).annotations.tasks.filter( task => !task.done)

Something almost identical in Ruby

margin_root.children.select{|c| c.value == "Big Item"}.annotations.tasks.select{|t| !t.done }

And so on. The ergonomics are improved because we can now have any number of different annotation types in the future by merely designating an annotation type (a key in the annotations hash), and relying on the rule that anything that doesn't have a special rule associated with it is just a "note."

Many good points by both of you!

An "index" may occur only once per item. If an index is repeated, the last occurrence will be used.

I think that specifying multiple values for a single key could be useful in some cases, like specifying multiple categories.

Could annotations be objects with the keys key and value? For the "Big Item" example by @burlesona, the JSON representation would be like this (omitting the raw stuff):

{
  "value": "Big Item",
  "annotations": [
    {
      "key": "this is just a note",
      "value": null
    },
    {
      "key": "this is another note",
      "value": null
    },
    {
      "key": "author",
      "value": "me"
    },
    {
      "key": "category",
      "value": ["code example", "margin example"]
    }
  ],
  "children": [
    {
      "value": "Put this on Github",
      "annotations": [
        {
          "key": "x",
          "value": null
        }
      ]
    },
    {
      "value": "Get this adopted?",
      "annotations": [
        {
          "key": " ",
          "value": null
        }
      ]
    },
    {
      "value": "Everything above is an annotation on Big Item, but I am a child."
    },
    {
      "value": "Of course big item can have multiple children."
    }
  ]
}

Using JS, finding tasks as well as valueless annotations would be easy:

const bigItem = marginRoot
  .children.find(item => item.value === 'Big Item')

const todoItems = bigItem
  .children.filter(item => item.annotations[0].key === ' ')

const doneItems = bigItem
  .children.filter(item => item.annotations[0].key === 'x')

const valuelessAnnotations = bigItem
  .annotations.filter(annotation => annotation.value === null)

const specificValuelessAnnotation = bigItem
  .annotations.find(annotation => annotation.key === 'this is just a note')
  // Optionally, also check that `annotation.value` is `null`

@mtsknn I like that idea actually, that would resolve a few things very neatly. So the resulting spec would be: there is only one kind of annotation, every annotation has a value, and some annotations may also have an "index" (as @gamburg called it). Thus, expressing each of these ideas:

Item 1 [a note]
Item 2 [tag: foo]
Item 3
  [tag: red]
  [tag: blue]

In JSON these annotations would come out as:

// Item 1 Annotations:
[{ "value": "a note" }]

// Item 2 Annotations:
[{ "index": "tag", "value": "foo" }]

// Item 3 Annotations
[{ "index": "tag", "value": "red" }, { "index": "tag", "value": "blue" }]

While that does mean there's a bit more work for something like "list all tags for this item" its not significant, and it resolves the rest of the ambiguities really nicely.

Also FWIW I think it would be nicer to stop calling these things indexes and start calling them keys just as @mtsknn suggested. Ie:

// An annotation
{ "key": "this is optional", "value": "this is required." }

🍻 Cheers for the good feedback @mtsknn!

Sounds very good, @burlesona! I think that keyless annotations (like you proposed) may make more sense than valueless annotations (I had it backwards), but I'll have to think about this a bit.

I also like how you have represented multiple annotations with the same key (item 3). Then it would be an application-level concern whether to include all values or just one (i.e. the last one, or why not even the first one in some cases). I agree that the bit more work would not be significant, e.g.:

const firstTag = item3.annotations
  .find(ann => ann.key === 'tag')
  .value

const allTags = item3.annotations
  .filter(ann => ann.key === 'tag')
  .map(ann => ann.value)

const lastTag = item3.annotations
  .filter(ann => ann.key === 'tag')
  .pop().value

Re: keys vs indexes: looks like @vlmutolo also talked about keys in his first comment, so kudos to him. :wink:

@mtsknn and others on this thread, I got far enough with my own implementation to have a working parser that can read Margin and write JSON. It's not baked enough for wide distribution yet, but it has tests and covers all the cases we've talked about. I went with the key, value structure for annotations we discussed above.

The code is here: https://github.com/burlesona/margin-rb

The divergences from Alex's implementation that I'm aware of ended up being as follows:

Every valid item must end in a newline, which means the document itself must end in a newline if there is an item on the last line.
Items can be one of two types, "item" or "task." To represent this I added a 'type' field on the Item's JSON representation. This is mostly informational and avoids users of the parsed data needing to do additional checks to figure out if an item is a task or not.
Items which are tasks have a boolean done field indicating whether they are done.
The value field on each item is cleaned of any leading or trailing decoration, including whitespace. Thus, the value field of a Task does not include the leading "checkbox" annotation.
Annotations are represented as objects with a required value and an optional key. In Alex's implementation the key and value of what he calls an "index" (an annotation with a colon in it) are not machine parsed.
If the value of an annotation is strictly numeric, it is parsed into a number, not a string.

Note that I realized when farther into implementing this that it doesn't really make sense for a task to be considered an annotation, rather a task is just an item with an extra done field.

To make this easy for consumers to work with I added the type field on items to indicate if it's a regular item or a task.

Happy to hear any feedback you all have. I've got reasonable test coverage now but will likely add tests for more cases soon, as well as a CLI.

Nice job, @burlesona! My Ruby is a bit rusty (pun intended), but the code looks good to me.

I can think of a couple of edge cases, though they are not necessarily specific to your implementation. I'm planning on opening an issue about them here (i.e. gamburg/margin) later as I think they should be thought through and documented. Hopefully there'll be only a few of them.

I'm not quite sold on the idea of having two types of items ("items" and "tasks") since it should be possible to use the [ ] and [x] annotations to make items be interpreted as tasks. 🤔

However, I do agree that annotation keys and values should be trimmed (item values are also trimmed as well as stripped from ornaments), but that causes the key/type (or the value, see below) of the [ ] annotation to be an empty string. That's a bit vague, but is it a problem? (If it is, e.g. [_] could be used as an alternative.)

Also, [], [•] and e.g. [•••] would be interpreted as the same annotation, but that I don't see as a significant problem.

(Edit: these two potential problems are related to https://github.com/gamburg/margin/issues/6#issuecomment-626810596. Saving leading annotations separately from other annotations (e.g. "item_prefix": "[ ]") would possibly solve both problems.)

Re: indexes vs keys:

Now that I read through Margin's documentation again, I noticed that @gamburg actually uses the term "type" instead of "index." There's also a section called "Indexes," but it says:

An index is any item that parents an annotation of type filter

I wonder where you picked the idea that annotation types are called indexes? Anyway, I think that either "type" or "key" is a good choice.

Re: should the type or the value of an annotation be optional:

This is actually already addressed in the documentation:

Any text up until the (optional) first colon in an annotation may safely be interpreted as the annotation type.

...

Annotations with types waiter and host:
Restaurant Staff
  Christine [waiter]
  Steven [waiter]
  Jessica [host]

So, an annotation always has a type, whereas the value is optional. I think this makes sense, at least in the provided example as well as in task annotations.

Re: the 2nd point in the very first comment of this issue, i.e. an annotation belonging to its child:

Items:
  - Item Y
        [I belong to Item Y]
  - [I belong to Item Z]
          Item Z

I agree with dropping this confusing feature. It even disagrees with another part of the documentation (emphasis added):

An annotation is any childless item wrapped in square brackets

Since the second last item in the given example is not childless, it can't be an annotation.

Similarly, this is also ambiguous:

Item X
  [Annotation A]
    Item Y
    Item Z

Does Annotation A belong to Item X, Y or Z?

Something like this even more difficult to interpret and parse:

Items:
    - Item X
        [Annotation A]
            [Annotation B]
                [Annotation C]
                    - Item Y
                        [Annotation D]

Does Annotation B belong to Item X or Y? How about Annotation A and C?

I agree with @burlesona that it would be better to interpret annotations that have children as valueless items (i.e. the value is an empty string) instead of annotations, i.e.:

Items:
    - Item X
        [Annotation A (actually a valueless item with this annotation)]
            [Annotation B (actually a valueless item with this annotation)]
                [Annotation C (actually a valueless item with this annotation)]
                    - Item Y
                        [Annotation D (belongs to Item Y)]

I would also specify that only the topmost children can be annotations, i.e. childless items wrapped in square brackets up until the first non-annotation child. For example:

Item X
  [Annotation A (belongs to Item X)]
  [Annotation B (belongs to Item X)]
  [Annotation C (belongs to Item X)]
  Item Y
  [Annotation D (actually a valueless item with this annotation)]
  Item Z

Otherwise it would be difficult to interpret whether Annotation D belongs to Item X, Y or Z.

Similarly:

Item X
  [Annotation A (belongs to Item X)]
  [Annotation B (belongs to Item X)]
  [Annotation C (actually a valueless item with this annotation because Item Y is a child)]
    Item Y
  [Annotation D (actually a valueless item with this annotation)]
  Item Z

On a related note, can annotations be ornamented? The documentation contains only one such example (the first code snippet of this comment). For example:

Item X
  - [foo]
  - [bar]

Are those annotations of Item X or valueless children of Item X? I'm asking because this isn't explicitly addressed in the documentation, and this might be relevant in #2 (see my comment).

Lots of excellent points in here!

So many, in fact, that I'm going to spin these off into separate issues 😝 If I miss any of the main points that were discussed here (I'm sure I will), feel free to reopen this issue – or, better yet, create a new issue.

Opened Issues

[x] How to treat "annotations with children" (per @burlesona's original comment)
[x] Store annotations as objects (per @mtsknn's comment)

Issues Raised that I Won't Open

But if you feel strongly about them, please feel free to open them yourself:

[ ] Disallow multiple values for a single key I agree with @mtsknn here. I don't think Margin should have an opinion on this.
[ ] Strip leading & trailing whitespace from annotation types & values (per @vlmutolo's comment) Personally I think this should be up to the application, and not done at the parser level. I could see the white space being relevant for certain types of annotations. And, where white space is irrelevant (most of the time), the application can choose to strip those on display.
[ ] Every valid item must end in a newline, which means the document itself must end in a newline if there is an item on the last line. (per @burlesona's parser) Not a problem if @burlesona's implementation requires it, but I hesitate to make this a requirement of valid Margin. My hope is that the parser itself can work this out (e.g. by adding an artificial newline before parsing, if there is none, and later discarding it). (Also, @burlesona I know you weren't necessarily suggesting we add this as a requirement of valid Margin, but I wanted to include every potential issue from this thread 😃).
[ ] The value field on each item is cleaned of any leading or trailing decoration, including whitespace. Thus, the value field of a Task does not include the leading "checkbox" annotation. (per @burlesona's parser) Again, no beef with this particular implementation, but I do want to keep a "task" as the suggested interpretation for an item that has annotation of type ` orx` – since I believe a lot of people already use plain text to-dos in this way – plus it fits neatly with the way annotations work.

gamburg / margin

Parser specification recommendations #11

Opened Issues

Issues Raised that I Won't Open