200ok-ch / org-parser

org-parser is a parser for the Org mode markup language for Emacs.
GNU Affero General Public License v3.0
316 stars 15 forks source link

[WIP] Parse content-line as text instead of just .* #27

Closed schoettl closed 3 years ago

schoettl commented 3 years ago

The goal of this PR is to enable the EBNF parser to parse details in the content after a headline, e.g. links, timestamps, …

Closes #26 Closes #30

schoettl commented 3 years ago

Hi @branch14 ,

I think we have to update your transformers.

The transformed parse result should be something like this, right?

4e4dc58d96f37a164c0281e3bfdf5929ee7479a4

  [[:headline {:level 1, :title "hello world"}]
   [:content
     [:raw "this is the first section\n\n"]
     [:parsed
       [:content-line [:text [:text-normal "this is the first section"]]]
       [:empty-line]]
   [:headline {:level 2, :title "and this"}]
   [:content
    [:raw "\nis another section\n"]
    [:parsed
      [:empty-line]
      [:content-line [:text [:text-normal "is another section"]]]]

https://github.com/schoettl/org-parser/blob/feat/parse-text/test/org_parser/transform_test.cljc#L40

Of course, the elements inside :parsed can and should also be transformed later. But the most important change IMO is to make the basic transformation having raw and parsed description. And maybe also make the :content part of the :headline as content always belongs to a headline?

I'm yet lost in the transformer code. So if you have time to update the transformer a little bit, it would be appreciated!

PS: How can I embed a line of code in comments like @munen did it here?

PPS: Regarding :raw: The original string can be extracted as documented here.

schoettl commented 3 years ago

Note to self: rebase after #29 is merged.

branch14 commented 3 years ago

@schoettl Yes, the transforms currently in place are just a humble beginning. They need to be extended!

I like the idea to have :raw as well as the full AST in the final result. And I would prefer calling it :ast over :parsed.

:content and :headline should actually both be part of a :section, as the headline introduces a new section in org lingo.

Using the metadata to extract the raw string is a nifty hack! I'll give it a try.

Do you think we'll need all tags in the AST of text, or can we use Instaparse's "Hide tag" feature for some early thinning?

Somewhat related: It seems to me it might make sense to refactor the tests (and transformations) to be organized by features. That way we can later have (1) the input org, (2) the resulting AST, (3) the transformed result, and (4) the output org closer together, which will greatly reduce cognitive load while working on a feature. What do you think?

schoettl commented 3 years ago

I like the idea to have :raw as well as the full AST in the final result. And I would prefer calling it :ast over :parsed.

I'm good with :ast, too. And that node would hold the parsed syntax tree nodes which will, step by step, be replaced with the transformed syntax tree nodes. Not just the raw parsed syntax tree, forever. Are we on the same page here?

:content and :headline should actually both be part of a :section, as the headline introduces a new section in org lingo.

I'm not sure about this naming. In #31 , I quote the org spec which seems to distinguish between headline and section (section beeing the content below the headline).

Do you think we'll need all tags in the AST of text, or can we use Instaparse's "Hide tag" feature for some early thinning?

You mean the <hidden-tag> syntax of instaparse? I would currently not change the EBNF in that way. Because for later transformations, e.g. of transformations of timestamps, we really need most of them to reason about the all timestamp properties...

Somewhat related: It seems to me it might make sense to refactor the tests (and transformations) to be organized by features. That way we can later have (1) the input org, (2) the resulting AST, (3) the transformed result, and (4) the output org closer together, which will greatly reduce cognitive load while working on a feature. What do you think?

I think it makes sense to sort tests by org feature. But I'm not sure about putting low-level parser tests and the tests of transformed AST in the same place. Because for some org features I have dozens of parser tests and edge cases. And many of them may not be relevant for the tests of transformations.

But having tests for parsing+transformation and org code generation close together (with sample input org to compare with generated org) is probably a good thing!

branch14 commented 3 years ago

I'm good with :ast, too. And that node would hold the parsed syntax tree nodes which will, step by step, be replaced with the transformed syntax tree nodes. Not just the raw parsed syntax tree, forever. Are we on the same page here?

Absolutely.

I'm not sure about this naming. In #31 , I quote the org spec which seems to distinguish between headline and section (section beeing the content below the headline).

Then I have to study the spec again. I thought something like this would be idiomatic

[:section
      {:level 1
       :title "hello world"
       :content
       {:raw ["this is the first section" "this line has *bold text*"]
        :ast [[:text [:text-normal "this is the first section"]]
              [:text
               [:text-normal "this line has "]
               [:text-styled
                [:text-sty-bold [:text-inside-sty-normal "bold text"]]]]]}}]

I like to stick to the wording of the spec where possible, but I also won't to provide a meaningful and easy to work with data structure. And I will likely not want to sacrifice the later for the former.

We could even drop the whole :section and just provide a vector of sections, right? Apart from content before the first section, everything in org is in a section by definition.

Do you think we'll need all tags in the AST of text, or can we use Instaparse's "Hide tag" feature for some early thinning?

You mean the <hidden-tag> syntax of instaparse? I would currently not change the EBNF in that way. Because for later transformations, e.g. of transformations of timestamps, we really need most of them to reason about the all timestamp properties...

Yes <hidden-tag>. My question was specifically targeted towards :text. I think [:text-sty-bold "bold text"] would do, instead of [:text-sty-bold [:text-inside-sty-normal "bold text"]]. Or not?

I think it makes sense to sort tests by org feature. But I'm not sure about putting low-level parser tests and the tests of transformed AST in the same place. Because for some org features I have dozens of parser tests and edge cases. And many of them may not be relevant for the tests of transformations.

I concur. I'm sure we'll find a structure that makes sense here.

I'm working on this. And I'll provide a PR. But I'll need some time.

schoettl commented 3 years ago

I like your proposed structure. But how can we represent the first section without headline? Omitting the :title tag and setting :level to 0? Hm…

Regarding the hidden tags. Yeah, your example makes sense. Originally, I planned to recursively parse styled markup (i.e. markup can contain markup, timestamps, and links), but that will probably not work. Currently, only very simple markup will be parsed, so it's no problem to hide that tag.

Great! Looking forward to your PR. Did you fork or extend this PR?

branch14 commented 3 years ago

I like your proposed structure. But how can we represent the first section without headline? Omitting the :title tag and setting :level to 0? Hm…

The list of sections will anyway be embedded in some buffer context, to store In-Buffer Settings. I'd expect it to look somewhat like this

{:settings ... ;; in-buffer settings
 :preamble ... ;; content before first headline
 :sections [section-0 section-1 ...]} ;; flat list of headlines & content

and sections will be just a map

{:title "some title"
 :level 1
 :content {:raw ...
           :ast ...}
 ...}

we could even drop :content, and just have

{:title "some title"
 :level 1
 :raw ...
 :ast ...
 ...}

Great! Looking forward to your PR. Did you fork or extend this PR?

I pulled your branch and I'm working on that. My PR will merge into yours, which we'll merge to master once its feature complete. Hope that works for you.

schoettl commented 3 years ago

and sections will be just a map

{:title "some title"
 :level 1
 :content {:raw ...
           :ast ...}
 ...}

we could even drop :content, and just have

{:title "some title"
 :level 1
 :raw ...
 :ast ...
 ...}

I agree. But I would keep the :content map to make more explicit what belongs to headline and what to content.

For example, we later also need :raw-title and :tags, :priority, … that all belongs more to the headline than to the content. So I would keep content in a separate map.

Great! Looking forward to your PR. Did you fork or extend this PR?

I pulled your branch and I'm working on that. My PR will merge into yours, which we'll merge to master once its feature complete. Hope that works for you.

:+1:

branch14 commented 3 years ago

I agree. But I would keep the :content map to make more explicit what belongs to headline and what to content.

For example, we later also need :raw-title and :tags, :priority, … that all belongs more to the headline than to the content. So I would keep content in a separate map.

Then it should probably be

:sections
[{:headline {:title ...
             :level ...
             ...}
  :content {:raw ...
            :ast ...
            ...}}
 ...
schoettl commented 3 years ago

Yes. It's a bit verbose but I think it's the cleanest solution! It will help to quickly understand the structure.

I remember that Organice's parse structure was a bit confusing for me because there is no such separation.

munen commented 3 years ago

@schoettl branch14 and me just worked ok a PR as discussed here. The PR refactors transform, uses raw, introduces integration testing and implements the suggestion from the question "How should the final transformed parse structure look like?" from https://github.com/200ok-ch/org-parser/issues/31.

I just saw that you're using a fork of org-parser at schoettl/org-parser. Hence, the PR is https://github.com/schoettl/org-parser/pull/1

Do you have a reason for using a fork? I presume it's because you still use the same repository from when you were not an org-parser maintainer(; If it works for you, I'd suggest to use the upstream repository directly. This will make at least these things easier that are now a bit complicated:

schoettl commented 3 years ago

@munen Oh, I wasn't aware of that I branched or pushed branches still to my fork repo. At least my local master has already this repo as upstream. Probably it's best, I remove my forked repo so that I cannot accidentally push to it anymore.

Can you just push the branch you worked on to 200ok-ch/org-parser?

schoettl commented 3 years ago

Or should I merge your PR at schoettl#1 into this, now?

branch14 commented 3 years ago

Or should I merge your PR at schoettl#1 into this, now?

I think the easiest is to open a new PR here, which will include #27 (which would otherwise merge https://github.com/schoettl/org-parser/tree/feat/parse-text). I'll do that now. Closing this PR in favour of #36