Closed schoettl closed 3 years ago
Hi @branch14 ,
I think we have to update your transformers.
The transformed parse result should be something like this, right?
4e4dc58d96f37a164c0281e3bfdf5929ee7479a4
[[:headline {:level 1, :title "hello world"}]
[:content
[:raw "this is the first section\n\n"]
[:parsed
[:content-line [:text [:text-normal "this is the first section"]]]
[:empty-line]]
[:headline {:level 2, :title "and this"}]
[:content
[:raw "\nis another section\n"]
[:parsed
[:empty-line]
[:content-line [:text [:text-normal "is another section"]]]]
https://github.com/schoettl/org-parser/blob/feat/parse-text/test/org_parser/transform_test.cljc#L40
Of course, the elements inside :parsed
can and should also be transformed later. But the most important change IMO is to make the basic transformation having raw and parsed description. And maybe also make the :content
part of the :headline
as content always belongs to a headline?
I'm yet lost in the transformer code. So if you have time to update the transformer a little bit, it would be appreciated!
PS: How can I embed a line of code in comments like @munen did it here?
PPS: Regarding :raw
: The original string can be extracted as documented here.
Note to self: rebase after #29 is merged.
@schoettl Yes, the transforms currently in place are just a humble beginning. They need to be extended!
I like the idea to have :raw
as well as the full AST in the final result. And I would prefer calling it :ast
over :parsed
.
:content
and :headline
should actually both be part of a :section
, as the headline introduces a new section in org lingo.
Using the metadata to extract the raw string is a nifty hack! I'll give it a try.
Do you think we'll need all tags in the AST of text
, or can we use Instaparse's "Hide tag" feature for some early thinning?
Somewhat related: It seems to me it might make sense to refactor the tests (and transformations) to be organized by features. That way we can later have (1) the input org, (2) the resulting AST, (3) the transformed result, and (4) the output org closer together, which will greatly reduce cognitive load while working on a feature. What do you think?
I like the idea to have
:raw
as well as the full AST in the final result. And I would prefer calling it:ast
over:parsed
.
I'm good with :ast
, too. And that node would hold the parsed syntax tree nodes which will, step by step, be replaced with the transformed syntax tree nodes. Not just the raw parsed syntax tree, forever. Are we on the same page here?
:content
and:headline
should actually both be part of a:section
, as the headline introduces a new section in org lingo.
I'm not sure about this naming. In #31 , I quote the org spec which seems to distinguish between headline and section (section beeing the content below the headline).
Do you think we'll need all tags in the AST of
text
, or can we use Instaparse's "Hide tag" feature for some early thinning?
You mean the <hidden-tag>
syntax of instaparse? I would currently not change the EBNF in that way. Because for later transformations, e.g. of transformations of timestamps, we really need most of them to reason about the all timestamp properties...
Somewhat related: It seems to me it might make sense to refactor the tests (and transformations) to be organized by features. That way we can later have (1) the input org, (2) the resulting AST, (3) the transformed result, and (4) the output org closer together, which will greatly reduce cognitive load while working on a feature. What do you think?
I think it makes sense to sort tests by org feature. But I'm not sure about putting low-level parser tests and the tests of transformed AST in the same place. Because for some org features I have dozens of parser tests and edge cases. And many of them may not be relevant for the tests of transformations.
But having tests for parsing+transformation and org code generation close together (with sample input org to compare with generated org) is probably a good thing!
I'm good with
:ast
, too. And that node would hold the parsed syntax tree nodes which will, step by step, be replaced with the transformed syntax tree nodes. Not just the raw parsed syntax tree, forever. Are we on the same page here?
Absolutely.
I'm not sure about this naming. In #31 , I quote the org spec which seems to distinguish between headline and section (section beeing the content below the headline).
Then I have to study the spec again. I thought something like this would be idiomatic
[:section
{:level 1
:title "hello world"
:content
{:raw ["this is the first section" "this line has *bold text*"]
:ast [[:text [:text-normal "this is the first section"]]
[:text
[:text-normal "this line has "]
[:text-styled
[:text-sty-bold [:text-inside-sty-normal "bold text"]]]]]}}]
I like to stick to the wording of the spec where possible, but I also won't to provide a meaningful and easy to work with data structure. And I will likely not want to sacrifice the later for the former.
We could even drop the whole :section
and just provide a vector of sections, right? Apart from content before the first section, everything in org is in a section by definition.
Do you think we'll need all tags in the AST of
text
, or can we use Instaparse's "Hide tag" feature for some early thinning?You mean the
<hidden-tag>
syntax of instaparse? I would currently not change the EBNF in that way. Because for later transformations, e.g. of transformations of timestamps, we really need most of them to reason about the all timestamp properties...
Yes <hidden-tag>
. My question was specifically targeted towards :text
. I think [:text-sty-bold "bold text"]
would do, instead of
[:text-sty-bold [:text-inside-sty-normal "bold text"]]
. Or not?
I think it makes sense to sort tests by org feature. But I'm not sure about putting low-level parser tests and the tests of transformed AST in the same place. Because for some org features I have dozens of parser tests and edge cases. And many of them may not be relevant for the tests of transformations.
I concur. I'm sure we'll find a structure that makes sense here.
I'm working on this. And I'll provide a PR. But I'll need some time.
I like your proposed structure. But how can we represent the first section without headline? Omitting the :title
tag and setting :level
to 0? Hm…
Regarding the hidden tags. Yeah, your example makes sense. Originally, I planned to recursively parse styled markup (i.e. markup can contain markup, timestamps, and links), but that will probably not work. Currently, only very simple markup will be parsed, so it's no problem to hide that tag.
Great! Looking forward to your PR. Did you fork or extend this PR?
I like your proposed structure. But how can we represent the first section without headline? Omitting the
:title
tag and setting:level
to 0? Hm…
The list of sections will anyway be embedded in some buffer context, to store In-Buffer Settings. I'd expect it to look somewhat like this
{:settings ... ;; in-buffer settings
:preamble ... ;; content before first headline
:sections [section-0 section-1 ...]} ;; flat list of headlines & content
and sections will be just a map
{:title "some title"
:level 1
:content {:raw ...
:ast ...}
...}
we could even drop :content
, and just have
{:title "some title"
:level 1
:raw ...
:ast ...
...}
Great! Looking forward to your PR. Did you fork or extend this PR?
I pulled your branch and I'm working on that. My PR will merge into yours, which we'll merge to master once its feature complete. Hope that works for you.
and sections will be just a map
{:title "some title" :level 1 :content {:raw ... :ast ...} ...}
we could even drop
:content
, and just have{:title "some title" :level 1 :raw ... :ast ... ...}
I agree. But I would keep the :content
map to make more explicit what belongs to headline and what to content.
For example, we later also need :raw-title
and :tags
, :priority
, … that all belongs more to the headline than to the content. So I would keep content in a separate map.
Great! Looking forward to your PR. Did you fork or extend this PR?
I pulled your branch and I'm working on that. My PR will merge into yours, which we'll merge to master once its feature complete. Hope that works for you.
:+1:
I agree. But I would keep the
:content
map to make more explicit what belongs to headline and what to content.For example, we later also need
:raw-title
and:tags
,:priority
, … that all belongs more to the headline than to the content. So I would keep content in a separate map.
Then it should probably be
:sections
[{:headline {:title ...
:level ...
...}
:content {:raw ...
:ast ...
...}}
...
Yes. It's a bit verbose but I think it's the cleanest solution! It will help to quickly understand the structure.
I remember that Organice's parse structure was a bit confusing for me because there is no such separation.
@schoettl branch14 and me just worked ok a PR as discussed here. The PR refactors transform
, uses raw
, introduces integration testing and implements the suggestion from the question "How should the final transformed parse structure look like?" from https://github.com/200ok-ch/org-parser/issues/31.
I just saw that you're using a fork of org-parser at schoettl/org-parser. Hence, the PR is https://github.com/schoettl/org-parser/pull/1
Do you have a reason for using a fork? I presume it's because you still use the same repository from when you were not an org-parser maintainer(; If it works for you, I'd suggest to use the upstream repository directly. This will make at least these things easier that are now a bit complicated:
@munen Oh, I wasn't aware of that I branched or pushed branches still to my fork repo. At least my local master has already this repo as upstream. Probably it's best, I remove my forked repo so that I cannot accidentally push to it anymore.
Can you just push the branch you worked on to 200ok-ch/org-parser?
Or should I merge your PR at schoettl#1 into this, now?
Or should I merge your PR at schoettl#1 into this, now?
I think the easiest is to open a new PR here, which will include #27 (which would otherwise merge https://github.com/schoettl/org-parser/tree/feat/parse-text). I'll do that now. Closing this PR in favour of #36
The goal of this PR is to enable the EBNF parser to parse details in the content after a headline, e.g. links, timestamps, …
Closes #26 Closes #30
:raw
along with:ast