200ok-ch / org-parser

org-parser is a parser for the Org mode markup language for Emacs.
GNU Affero General Public License v3.0
325 stars 16 forks source link

Idea for collaboration #59

Open gitonthescene opened 3 years ago

gitonthescene commented 3 years ago

Hello, I found your project from the worg tools list. First, sorry for the semi-spam nature of this issue. I had a notion for a project that the org community might find useful and I'm looking for feedback. Feel free to close this issue if it doesn't sound useful to you.

My idea is to start a list of org-mode snippets which can serve as a test bed for people developing tools. The idea is that having a separate collection of repositories makes it easier for others in the community to benefit from the examples developed through communication with users.

Users could use these samples to try to construct minimal examples of issues they're having and/or contribute examples there which others could benefit from. Exactly how it will take shape is still up in the air.

These samples could also serve as a place to discuss ideas about how to develop the grammar itself. According to worg, the spec is still in draft state.

There's not much there at the moment. Mostly because I don't want to commit too early to what seems like it might be useful. I'll add more examples as I go.

If you like the concept and/or want to contribute and/or just want to offer feedback, I'd very much appreciate it.

Again, sorry for the spam.

schoettl commented 3 years ago

Hi and thanks for opening this discussion,

the idea of a common base of test org files for parsers came already up a while ago somewhere... I don't remember.

However, for parsers, I don't see a real value of your repo so far, because for automated tests we need input and expected output. The problem is that the output format differs between parsers. It depends not only on naming, e.g. symbol names in EBNF but also on technical reasons like the choice of the programming language and the type of the parser.

Regarding your sample org files: It's an uncommon but maybe a good idea, to have multiple samples in one file, separated with NUL byte. I think this is better to have everything visible in one file instead of thousands of files with a naming schema like heading-*.org. I'm not sure however how well editors (or users) play with NUL bytes.

gitonthescene commented 3 years ago

Thanks for getting back. You’re exactly right that the value is in being able to validate the output for the input and that different parsers will produce different outputs.

But that highlights something else. We’re not working off of a common EBNF. As far as I know there isn’t one. I was hoping to use a common set of test cases to try to suss out how the different parsers behave differently and ideally work towards a consensus.

It’s a bit pie in the sky, but I thought I’d at least take the first steps.

As far as having tests all in the same file, my thinking was that real tests need to test separate files because some parts of the grammar should only exist once in a file or are affected by what other parts exist in the file. Editing lots of little files is a pain though, which is why I edit just the single file but generate lots of test cases from that single file. It also makes debugging a single feature a lot easier.

In any event, I fully admit that my ideas are still a little vague. I’m looking at writing a PEG grammar for org-mode and the exercise reveals that the spec I linked above isn’t very precise. But that just opens the question of how to deal with that imprecision. I thought having examples to point at would help that discussion.

Again, thanks for getting back. If my ideas solidify any more I can let you know if you’re interested. Or if you have any more thoughts on this feel free to open an issue. I’ll close this issue for now.

Thanks again

schoettl commented 3 years ago

Alright, feel free to post here again and re-open the issue, I'm interested in news :)

I also have little experience with PEG. It combines the EBNF with additional "logic" (which we have separated in a transform step).

I'd like to point you to tree-sitter ;) it's also JavaScript and similar to PEG.js. For me, tree-sitter looks really promising and just yesterday we had a discussion in the #organice matrix channel about it and how we could use org-parser's EBNF to generate the tree-sitter code. There are already at least two attempts to implement orgmode parsing with tree-sitter.

gitonthescene commented 3 years ago

FWIW, here is my thumbnail sketch of a grammar. It needs a lot of work.

org-mode-grammar.txt

I didn’t know about this EBNF either. Thanks for the pointer.

nightscape commented 3 years ago

@schoettl this might serve as a starting point, it's a EBNF to treesitter tool: https://github.com/returntocorp/tree-sitter-scala-r2c/blob/master/script/parse_grammar.lua

schoettl commented 3 years ago

Wow, interesting, thanks for that link, @nightscape ! I hope I find some time to try that.

gitonthescene commented 3 years ago

FWIW, I've added a simple little script to dump the structure of an org-document as emacs sees it. I used dash installed as a package, so depending on your emacs setup you might have to tweak it a bit.

That is to say, it shows one version of output. This can become a baseline of sorts for what other output looks like. Normalizing to a standard form obviously depends on the particulars of what's being compared.

Just thought I'd update because it addresses one of your points.

gitonthescene commented 3 years ago

FWIW there is another effort to create an EBNF mentioned here.

I also have put forth a suggestion which might smooth over the adoption of an “official grammar”.