PoiScript / orgize

A Rust library for parsing org-mode files.
https://poiscript.github.io/orgize/
MIT License
277 stars 34 forks source link

discuss: a new specifications #36

Open PoiScript opened 4 years ago

PoiScript commented 4 years ago

A correct implementation requires a precise specifications. However, neither org syntax draft nor org elements api can really serves as a good specifications: Org syntax doesn't specify syntax unambiguously, meanwhile, org-elements-api is quite buggy and provides some apparent wrong results in some cases (e.g. #22, #19 and #14).

To resolve that, I think the ultimate solution is to maintain a specification by our own. To be clear, it's not going to create a subset of org markup language but just describes the expected results you will get when using orgize.

Fortunately, we don't need to start from scratch. We can just make a copy of the original org syntax and adapt it for our needs (like defining required and optional fields). Then, I would like to borrow some concepts from CommonMark Spec, especially the part of handling whitespace and list indenting. In short, our new specification will be majorly based on org syntax and with some modifications and additions.

Feel free to leave any comments below if you have any idea about the new upcoming specification =).


As a reminder, here're some issues/commits that need to be reopened/reverted after applying the new specification.

17: should be reopened. Org syntax clarified that title should be matched after other part have been matched.

ba9c83c: should reverted. We should only handle ascii whitespace. I think it is totally unnecessary to take care of unicode whitespace if we don't need to be compatible with org-elements-api.

26 #34: should be closed and moved to a new issue. Org syntax specifies that headlines' stars must be followed by at least one space character. But I think tab is also acceptable (just like CommonMark does and I will include it in the new specifications latter). Hence, only ***, *** \n or ***\t\t should be valid, but not ***, ****\r or ***\n.

33: should be closed.

PoiScript commented 4 years ago

I had created a SYNTAX.md at docs/. Sorry for not using org-mode, but github can't render org-mode file properly :(.

calmofthestorm commented 4 years ago

I like this idea. Using org-elements as a guide is good, but it has too many bugs to be a spec. I haven't looked into it, but maybe the org syntax draft at orgmode.org is open to contributions? If so, and they had a common understanding of what the source of truth is (likely org-mode itself), perhaps we could upstream some/all of these edge cases.

PoiScript commented 4 years ago

I haven't looked into it, but maybe the org syntax draft at orgmode.org is open to contributions?

Neither have I. But the original version of org syntax actually contains some implementation information about elisp, which is irrelevant to orgize. So I wish to have a new specification without anything related to eslip.

Regarding whitespace, org-mode itself seems to use different definitions in different places.

Yes, this is the problem that the new specification is going to resolve. Basically, I want to make clear definitions of space, tab, line ending, line and blank line which can be used throughout the whole specification. Here is an initial version of definitions (majorly stole from CommonMark):

A space is U+0020.

A tab is U+0009, it's treated as 4 spaces when it's used to define block structure.

A line is a sequence of zero or more characters other than newline (U+000A) or carriage return (U+000D), followed by a line ending or by the end of file.

A line ending is a newline (U+000A), a carriage return (U+000D) not followed by a newline, or a carriage return and a following newline.

A blank line is a line containing no characters, or a line containing only spaces (U+0020) or tabs (U+0009).

org-element and org-mode both reject *\t as a headline. The spec says "a space character", one of only two places it says that instead of "whitespace character". I think we should accept only a space character there.

I don't think it makes sense for a parser to be that permissive, but I do think that it is reasonable to accept * [#A]Hello World, since both org-element and org-mode do.

I can see that you're trying to behave the same way as org-mode or org-element. But I don't think it's a good idea to introduce too much special cases. Just like the definitions I mentioned above, if a headlines' stars should be followed by a space, I would expect it also accepts a tab. Same with todo keyword, priority and tags. After all, I believe the specification should be able to enforce the consistent behave of parsing without bringing too much special cases.

calmofthestorm commented 4 years ago

That seems like a perfectly reasonable way to define it, thank you for the explanation. I care more about clear, consistent, documented behavior than what it is, and this specification will deliver that.