200ok-ch / org-parser

org-parser is a parser for the Org mode markup language for Emacs.
GNU Affero General Public License v3.0
324 stars 16 forks source link

Problems with parsing emphasis/style markup #12

Open schoettl opened 4 years ago

schoettl commented 4 years ago

Problems with the ungreatful and recursive nature of emphasis markup [/*_+] are documented in #9, specifically

A summary in German:

Ein Syntaxelement in Orgmode ist ähnlich wie bei Markdown:

Dies ist /kursiver/ Text.

(* = fett, _ = unterstrichen, + = durchgestrichen, ...)

Jetzt geht es darum den Text zu parsen.

text := { text-kursiv | text-normal }
text-kursiv := '/' text '/'
text-normal := regex|.[^/]*|

Die Schwierigkeiten sind jetzt die:

  1. Vor und nach eine /kursiven/ Text muss ein Leerzeichen oder Interpunktion sein.
  2. Der /kursive Text/ selbst darf nicht mit Leerzeichen beginnen oder enden.
  3. Der Delimiter darf auch im /kursiven/italic Text/ vorkommen.
  4. Der /kursive/ Text/ ist so kurz wie möglich (also hier nur "kursive").
  5. Die Sache ist rekursiv, also /kursiver Text kann auch fett/ sein.
  6. Die Delimiter sind nicht eindeutig, z.B. kann der Schrägstrich/Slash auch so vorkommen und _ je nach Kontext Unterstreichung oder Tief_stellen bedeuten.
  7. Neben den Symbolen text-kursiv und text-normal gibt es weitere Syntaxelemente wie Links, die mit einer ganz anderen Syntax gekennzeichnet sind.

Org Mode selbst löst das beim Export wohl auf eine andere Weise, nicht durch einen BNF Parser sondern durch Programmierung und insbesondere einen Regex, der nicht nur den kursiven Text matcht, sondern auch den Buchstaben davor und dahinter.

Nur hier funktioniert das nicht so einfach: Zum einen kann ich beim Symbol text-kursiv keinen Regex angeben (wegen der Rekursion). Zum anderen kann text-kursiv nicht wissen, ob vor ihm ein Leerzeichen kommt, oder nicht. Look-ahead ist unterstützt vom verwendeten BNF, aber nicht Look-back. Und zuletzt gestaltet sich der Regex von text-normal als schwierig, weil er eben an der richtigen Stelle stoppen muss: Mal nach einem Leerzeichen, wenn danach ein / kommt. Mal ohne zusätzliches Kriterium, wenn danach ein [ kommt (Link oder Fußnote).

Siehe auch in Emacs org-emph-re.

munen commented 4 years ago

@branch14 Do you potentially have an input/idea on how to tackle this?

@schoettl provided more links on the issue in his last PR, too: https://github.com/200ok-ch/org-parser/pull/9

branch14 commented 4 years ago

@munen, @schoettl I acknowledge that Org-mode might have syntactic elements that cannot properly parsed by EBNF/PEG. While the project's goal is to have as much of the Org-mode syntax formalized in a EBNF/PEG, we need an alternative (more pragmatic) approach to provide a full featured parser.

Some of the issues mentioned here can be implemented in the grammar, while others might need to be deferred to the transformation step mentioned in #9. E.g. multi-line text styles cannot be tackled with a line based parser, but can easily dealt with Regexs in a transformation.

branch14 commented 4 years ago

Here I layed out how the code for transformation could look like: #15

For parsing multi-line styles a 2nd transformation step would be needed. (Not part of this PR.)

schoettl commented 4 years ago

Even single-line styles have severe problems in EBNF. I want to check out, if it is reasonable to put all style (emphasis and verbatim, how it's called in the spec) into the transformations. An advantage of this approach is that we can reuse the logic and regexes from orgmode.

munen commented 4 years ago

An advantage of this approach is that we can reuse the logic and regexes from orgmode.

:+1:

munen commented 4 years ago

That could actually prove to be a major benefit.

As long as Emacs is not using org-parser cough, it could prove very benefitial to keep some complicated parts close to the Elisp codebase.

schoettl commented 4 years ago

Note to self:

  1. Check if it's OK or "allowed" that emphasis spans other elements that are already parsed via EBNF. E.g. *this [[url]] is bold and _O_2 is also included in the underlined_ text*

  2. If 1. is good, there is one difficulty: When we apply the emphasis regexes on a parsed line, we need a string as input, not a parse tree containing links or otherwise parsed elements. ~To get the string input, AFAIK we have to "export" the parse tree to get the original string. Similar to what is done in organice's org_export.js.~

It would be great if instaparse has a way to get the original, unparsed text along with the parse tree, ~but I don't think it has.~ EDIT: It has: meta and span functions

schoettl commented 4 years ago

Regarding 2.: instaparse has a built-in way to get position/location meta information from the parse tree! Even if the parse tree looks like it only holds the parsed data, the meta and span functions return this information: https://cljdoc.org/d/instaparse/instaparse/1.4.9/doc/readme#character-spans

So, if we have the original input text, it's no problem to apply emphasis regexes on the original line. We do have all position information about elements parsed via EBNF.

jcguu95 commented 7 months ago

It's been 4 years, and this issue seems to be the last unfinished todo item in /org-parser/README.org. What is its current state?

  1. Is this the only gap between org-parser and the official org parser?
  2. Will this be done in the future, perhaps by extending EBNF and interparse a bit more?
schoettl commented 7 months ago

The project hasn't been very active since then and no one has worked on this specific problem. I guess it's not the only gap. The check list in the README may miss some less common org features. There is also a big room for enhancements on the transformation side (the step of converting the instaparse parse result into a more meaningful data structure). E.g. joining lines of a paragraph which would be a requirement for parsing style markup in the transformation step.

I'm personally more concerned about #56 – that's why I don't invest much time.