executablebooks / markdown-it-docutils

A markdown-it plugin for implementing docutils style roles/directives.
https://executablebooks.github.io/markdown-it-docutils/
MIT License
12 stars 8 forks source link

Ensure good design and learn from docutils/sphinx #18

Open chrisjsewell opened 2 years ago

chrisjsewell commented 2 years ago

The text below is copied from https://github.com/executablebooks/markdown-it-myst/pull/31

TL;DR docutils/sphinx, I feel can be a little overly complex and has some shortcomings, BUT many aspects are there for a reason and we should learn from it and ensure the design can accommodate/be extensible for the necessary complexity from the outset.

I would like to eventually create some UML/SysML diagrams of the design, for ourselves and others to understand


In this document we outline the general design decisions for a generic MyST parser, and then how this applies to the Javascript parser we have built here. Note, this may eventually be moved to a "top-level" documentation of MyST.

Currently, the primary implementation of a MyST parser is written as a Sphinx extension (in Python); using markdown-it to initially parse the source text to a "token stream" (a list of syntax tokens, encapsulating the whole document and its content), then we convert this token stream to a docutils AST tree (in the myst-parser extension), which Sphinx then uses to convert to the desired output format (e.g. HTML or LaTeX). Naturally this design is tightly coupled to Sphinx, but (a) in Javascript we do not have an implementation of Sphinx, and (b) we would like to move away from being reliant on any one "technology" for parsing, and instead outline a more generic "standard" for MyST parsing, which anyone could in principle implement. What we don't want to do though is end up unknowingly reimplementing a worse version of Sphinx. In the next section then we discuss the Sphinx design, the reasons behind it, and some of its technical limitations.

Analysis of the Sphinx design

The sphinx design is outlined in more detail at https://www.sphinx-doc.org/en/master/extdev/appapi.html#sphinx-core-events, but the basic stages can be described as:

  1. We read in a global configuration for the parse.
  2. We need to parse the each document into an "output format agnostic" Abstract Syntax Tree (AST). This is performed in a linear manner, stepping through each line of the source text.
    • As well as creating the AST here, we also store aspects of the document to a global state object (known as the BuildEnvironment), for later fast lookup.
    • Also to note here, directive and role syntaxes are processed as they are encountered.
  3. There are certain per-document AST operations we cannot perform until we have parsed that document, e.g. replacing substitution references with their definitions. These are known as transforms, and are applied in order of priority.
  4. We then want to cache each document AST, so that we do not have to re-parse every document when one changes.
    • We also need to cache the global state object
  5. There are certain per-document AST operations we cannot perform until we have parsed all documents, i.e our global state is complete and up-to-date, for example matching inter-document references to their targets. These are known as post-transforms, and are applied in order of priority.
    • Additionally, we may want to perform operations specific to a certain output format. These are also included in the post-transforms, and so the post-transforms are run once per each output format.
    • The per document ASTs at this point are transient, i.e. they are not cached, since any change to any document could affect them, and so there is no benefit in caching.
  6. Now we have our final ASTs, we can perform the render, whereby we convert the ASTs to the output format.
    • A renderer in sphinx is known as a Builder
    • Another important thing we need to do is map filepath references (e.g. for images and downloadable files) to paths in the build folder, and ensure these files are copies there.

Another core concept is that of the logger, which logs specific information/warnings to the console, but also can be configured to fail the build (i.e. produce a non-zero exit code) if any warnings are encountered. In this way the build is robust to errors (we don't want the whole build failing because of one syntax error), but allows us to programmatically tell if there any issues with our documentation (e.g. when we run CI tests).

As an addendum to the above design, we can also consider the steps to re-build the outputs, given an initial build has already been performed.

  1. Within the global configuration specification, each variable defines a rebuild condition, i.e. whether a change in this variable should invoke a full rebuild (invalidating the cached document ASTs and global env and starting again from step (2)) or simply requires a rebuild from step (5).
    • The variables from the last parse are stored in the global env, and compared to those from the current parse.
  2. In deciding which documents should be reparsed from step (2), the mtime of the source file and cached AST file are compared
  3. step (5) and (6) are always run.

Lastly we should consider Sphinx's plugin system, in the form of extensions which can:

  1. Specify additional configuration variables (including type validation and rebuild condition)
  2. Define functions that occur after the configuration has been read (known as config-inited events) e.g. to apply additional validation
  3. Add new parsers to apply to particular file name suffixes
  4. Define additional roles and directives
  5. Define additional transforms (and their priority)
  6. Define additional post-transforms (their priority and what output formats they apply to)
  7. Define additional renderers
  8. Override which cached documents are considered outdated/invalid (known as env-get-outdated events)
  9. Interject at a number of other key stages in the build (see other events)

Although a lot of this system is well designed, and we will certainly need to include most if not all of these steps, there a number of design issues that could be improved:

chrisjsewell commented 2 years ago

Note, this is not to say we need to immediately implement the full functionality of sphinx, which will be no mean feat. But, where possible, we should put thought in to the initial steps, such that we do not have to completely re-design everything, once it (I feel inevitably) gets more complex.

chrisjsewell commented 2 years ago

Note 2, we should also be cognisant of the use cases:

  1. The current use case this package is utilised for is "single-page HTML previews"; in https://github.com/executablebooks/myst-vs-code and the also eventually https://github.com/executablebooks/jupyterlab-myst

Here we can "get away" with not having to fully render every role/directive etc, or deal with any multi-page issues (e.g. cross-page referencing). An important thing though, is that the parse is sufficiently fast, for realtime re-rendering.

  1. Another use case I would like to work towards is an LSP. Here we might want to parse all documents in the background, and maintain a "database" of references/targets and their position in the document (e.g. for "jump to definition" and reference auto-complete features)

  2. Actually rendering a full book