Open schoettl opened 4 years ago
@branch14 Do you potentially have an input/idea on how to tackle this?
@schoettl provided more links on the issue in his last PR, too: https://github.com/200ok-ch/org-parser/pull/9
@munen, @schoettl I acknowledge that Org-mode might have syntactic elements that cannot properly parsed by EBNF/PEG. While the project's goal is to have as much of the Org-mode syntax formalized in a EBNF/PEG, we need an alternative (more pragmatic) approach to provide a full featured parser.
Some of the issues mentioned here can be implemented in the grammar, while others might need to be deferred to the transformation
step mentioned in #9. E.g. multi-line text styles cannot be tackled with a line based parser, but can easily dealt with Regexs in a transformation.
Here I layed out how the code for transformation could look like: #15
For parsing multi-line styles a 2nd transformation step would be needed. (Not part of this PR.)
Even single-line styles have severe problems in EBNF. I want to check out, if it is reasonable to put all style (emphasis and verbatim, how it's called in the spec) into the transformations. An advantage of this approach is that we can reuse the logic and regexes from orgmode.
An advantage of this approach is that we can reuse the logic and regexes from orgmode.
:+1:
That could actually prove to be a major benefit.
As long as Emacs is not using org-parser
cough, it could prove very benefitial to keep some complicated parts close to the Elisp codebase.
Note to self:
Check if it's OK or "allowed" that emphasis spans other elements that are already parsed via EBNF. E.g. *this [[url]] is bold and _O_2 is also included in the underlined_ text*
If 1. is good, there is one difficulty: When we apply the emphasis regexes on a parsed line, we need a string as input, not a parse tree containing links or otherwise parsed elements. ~To get the string input, AFAIK we have to "export" the parse tree to get the original string. Similar to what is done in organice's org_export.js
.~
It would be great if instaparse has a way to get the original, unparsed text along with the parse tree, ~but I don't think it has.~ EDIT: It has: meta
and span
functions
Regarding 2.: instaparse has a built-in way to get position/location meta information from the parse tree! Even if the parse tree looks like it only holds the parsed data, the meta
and span
functions return this information: https://cljdoc.org/d/instaparse/instaparse/1.4.9/doc/readme#character-spans
So, if we have the original input text, it's no problem to apply emphasis regexes on the original line. We do have all position information about elements parsed via EBNF.
It's been 4 years, and this issue seems to be the last unfinished todo item in /org-parser/README.org
. What is its current state?
org-parser
and the official org
parser?interparse
a bit more?The project hasn't been very active since then and no one has worked on this specific problem. I guess it's not the only gap. The check list in the README may miss some less common org features. There is also a big room for enhancements on the transformation side (the step of converting the instaparse parse result into a more meaningful data structure). E.g. joining lines of a paragraph which would be a requirement for parsing style markup in the transformation step.
I'm personally more concerned about #56 – that's why I don't invest much time.
Problems with the ungreatful and recursive nature of emphasis markup
[/*_+]
are documented in #9, specificallyA summary in German:
Ein Syntaxelement in Orgmode ist ähnlich wie bei Markdown:
Dies ist /kursiver/ Text.
(* = fett, _ = unterstrichen, + = durchgestrichen, ...)
Jetzt geht es darum den Text zu parsen.
Die Schwierigkeiten sind jetzt die:
Org Mode selbst löst das beim Export wohl auf eine andere Weise, nicht durch einen BNF Parser sondern durch Programmierung und insbesondere einen Regex, der nicht nur den kursiven Text matcht, sondern auch den Buchstaben davor und dahinter.
Nur hier funktioniert das nicht so einfach: Zum einen kann ich beim Symbol text-kursiv keinen Regex angeben (wegen der Rekursion). Zum anderen kann text-kursiv nicht wissen, ob vor ihm ein Leerzeichen kommt, oder nicht. Look-ahead ist unterstützt vom verwendeten BNF, aber nicht Look-back. Und zuletzt gestaltet sich der Regex von text-normal als schwierig, weil er eben an der richtigen Stelle stoppen muss: Mal nach einem Leerzeichen, wenn danach ein / kommt. Mal ohne zusätzliches Kriterium, wenn danach ein [ kommt (Link oder Fußnote).
Siehe auch in Emacs
org-emph-re
.