Ensure good design and learn from docutils/sphinx

chrisjsewell commented 3 years ago

The text below is copied from https://github.com/executablebooks/markdown-it-myst/pull/31

TL;DR docutils/sphinx, I feel can be a little overly complex and has some shortcomings, BUT many aspects are there for a reason and we should learn from it and ensure the design can accommodate/be extensible for the necessary complexity from the outset.

I would like to eventually create some UML/SysML diagrams of the design, for ourselves and others to understand

In this document we outline the general design decisions for a generic MyST parser, and then how this applies to the Javascript parser we have built here. Note, this may eventually be moved to a "top-level" documentation of MyST.

Currently, the primary implementation of a MyST parser is written as a Sphinx extension (in Python); using markdown-it to initially parse the source text to a "token stream" (a list of syntax tokens, encapsulating the whole document and its content), then we convert this token stream to a docutils AST tree (in the myst-parser extension), which Sphinx then uses to convert to the desired output format (e.g. HTML or LaTeX). Naturally this design is tightly coupled to Sphinx, but (a) in Javascript we do not have an implementation of Sphinx, and (b) we would like to move away from being reliant on any one "technology" for parsing, and instead outline a more generic "standard" for MyST parsing, which anyone could in principle implement. What we don't want to do though is end up unknowingly reimplementing a worse version of Sphinx. In the next section then we discuss the Sphinx design, the reasons behind it, and some of its technical limitations.

Analysis of the Sphinx design

The sphinx design is outlined in more detail at https://www.sphinx-doc.org/en/master/extdev/appapi.html#sphinx-core-events, but the basic stages can be described as:

We read in a global configuration for the parse.
We need to parse the each document into an "output format agnostic" Abstract Syntax Tree (AST). This is performed in a linear manner, stepping through each line of the source text.
- As well as creating the AST here, we also store aspects of the document to a global state object (known as the BuildEnvironment), for later fast lookup.
- Also to note here, directive and role syntaxes are processed as they are encountered.
There are certain per-document AST operations we cannot perform until we have parsed that document, e.g. replacing substitution references with their definitions. These are known as transforms, and are applied in order of priority.
We then want to cache each document AST, so that we do not have to re-parse every document when one changes.
- We also need to cache the global state object
There are certain per-document AST operations we cannot perform until we have parsed all documents, i.e our global state is complete and up-to-date, for example matching inter-document references to their targets. These are known as post-transforms, and are applied in order of priority.
- Additionally, we may want to perform operations specific to a certain output format. These are also included in the post-transforms, and so the post-transforms are run once per each output format.
- The per document ASTs at this point are transient, i.e. they are not cached, since any change to any document could affect them, and so there is no benefit in caching.
Now we have our final ASTs, we can perform the render, whereby we convert the ASTs to the output format.
- A renderer in sphinx is known as a Builder
- Another important thing we need to do is map filepath references (e.g. for images and downloadable files) to paths in the build folder, and ensure these files are copies there.

Another core concept is that of the logger, which logs specific information/warnings to the console, but also can be configured to fail the build (i.e. produce a non-zero exit code) if any warnings are encountered. In this way the build is robust to errors (we don't want the whole build failing because of one syntax error), but allows us to programmatically tell if there any issues with our documentation (e.g. when we run CI tests).

As an addendum to the above design, we can also consider the steps to re-build the outputs, given an initial build has already been performed.

Within the global configuration specification, each variable defines a rebuild condition, i.e. whether a change in this variable should invoke a full rebuild (invalidating the cached document ASTs and global env and starting again from step (2)) or simply requires a rebuild from step (5).
- The variables from the last parse are stored in the global env, and compared to those from the current parse.
In deciding which documents should be reparsed from step (2), the mtime of the source file and cached AST file are compared
step (5) and (6) are always run.

Lastly we should consider Sphinx's plugin system, in the form of extensions which can:

Specify additional configuration variables (including type validation and rebuild condition)
Define functions that occur after the configuration has been read (known as config-inited events) e.g. to apply additional validation
Add new parsers to apply to particular file name suffixes
Define additional roles and directives
Define additional transforms (and their priority)
Define additional post-transforms (their priority and what output formats they apply to)
Define additional renderers
Override which cached documents are considered outdated/invalid (known as env-get-outdated events)
Interject at a number of other key stages in the build (see other events)

Although a lot of this system is well designed, and we will certainly need to include most if not all of these steps, there a number of design issues that could be improved:

The configuration must be written as Python file, instead it should be a more general, declarative format (like YAML or TOML)
The document ASTs are designed as Python class instances (see https://github.com/chrisjsewell/docutils/blob/develop/docutils/docutils/nodes.py) and cached as pickled files (see https://docs.python.org/3/library/pickle.html). Again this should be a more general, serializable format such a JSON.
- Usually when testing docutils ASTs we use the pformat method which converts it into a "pseudo-XML" string, although this does not actually contain all the information about the AST.
- The markdown-it token stream system allows for a much more facile serialisation/de-serialisation to JSON
The environment object is again a Python class instance and stored by pickling.
- More so than just having a general storage format here this is currently a big obstacle for programs that want to access the environment in a dynamic manner, for example language servers that could provide auto-completions and navigation for inter-file references. Here it would be really useful for the environment to be a database with concurrent read/write access.
- https://en.wikipedia.org/wiki/SQLite would possibly be a good choice here, as a serverless DB format (i.e. does not require a background process to be running), with support for basically every programming language. Although this may not be ideal for a web-based environment
- See https://github.com/chrisjsewell/rst-language-server, where I was working on such a language server and had to adapt docutils/sphinx in such a way (it was also beneficial to store the line number mappings for references etc)
Mappings of directive/role names to their processing functions/classes are stored in docutils as global variables. This is problematic for asynchronous document parsing.
The conversion of roles/directives is performed in the same pass as with all other syntaxes. This leads to the AST being intrinsically "lossy" in that there is no way to recover what directives were in the original source text (see https://github.com/executablebooks/rst-to-myst where I had to re-write some of the parsing code to achieve this).
Role/directive/transform code is very "side-effect heavy", as in they directly mutate the document AST and environment. There is already plenty of literature out there explaining why this is not ideal (e.g. https://softwareengineering.stackexchange.com/questions/15269/why-are-side-effects-considered-evil-in-functional-programming), and in general we should strive towards a more functional programming paradigm.
Overall sphinx is quite difficult to run in a programmatic (API) manner (as opposed to via the command-line)
Sphinx is quite tied to having files exist on a file system, whereas we may want to work in a more virtual file system

chrisjsewell commented 3 years ago

Note, this is not to say we need to immediately implement the full functionality of sphinx, which will be no mean feat. But, where possible, we should put thought in to the initial steps, such that we do not have to completely re-design everything, once it (I feel inevitably) gets more complex.

chrisjsewell commented 3 years ago

Note 2, we should also be cognisant of the use cases:

The current use case this package is utilised for is "single-page HTML previews"; in https://github.com/executablebooks/myst-vs-code and the also eventually https://github.com/executablebooks/jupyterlab-myst

Here we can "get away" with not having to fully render every role/directive etc, or deal with any multi-page issues (e.g. cross-page referencing). An important thing though, is that the parse is sufficiently fast, for realtime re-rendering.

Another use case I would like to work towards is an LSP. Here we might want to parse all documents in the background, and maintain a "database" of references/targets and their position in the document (e.g. for "jump to definition" and reference auto-complete features)
Actually rendering a full book

executablebooks / markdown-it-docutils

Ensure good design and learn from docutils/sphinx #18

Analysis of the Sphinx design