CDSoft / pp

PP - Generic preprocessor (with pandoc in mind) - macros, literate programming, diagrams, scripts...
http://cdelord.fr/pp
GNU General Public License v3.0
252 stars 21 forks source link

Macros Definitions Files: Thoughts and Questions #25

Closed tajmone closed 6 years ago

tajmone commented 7 years ago

I'm currently prototyping a custom flat-file static CMS that uses PP » Pandoc (+pandoc templates) to generate HTML documentation in a website fashion from markdown source files. The obvious advantage of this approach lies in PP macros being very flexible and highly customizable.

I wanted to share my experience with this project, as it might prompt suggestions and shred light on potential uses and new-features — I propose a new feature in the Conclusions. Please, if you see better solutions to approach this task I'd really appreciate your feedback. Pandoc has already become part of different CMS projects (like Pandocomatic), and I think that PP has great integration potential in this field.

Hopefully, my open source project will be ready published on GitHub in the nearby future, so others might benefit from it and re-adapt it to their uses. It's main goal is to create resouces documention projects that can be built into GitHub pages viewable website, as well as clonable repos containing coding resources + html docs.

YAML Settings Files Hierarchy

The project relies on a YAML setting file being present in each folder, plus a YAML file (or YAML headers) for each single document (in the form filename.yaml). The CMS loads in memory the PP-macros, in order to prevent redundant disk accesses; it also loads in memory each folder's YAML setting file for the duration of each folder's processing. The CMS then also loads the source files into memory and merges them with the pre-loaded macros definitions and YAML files before feeding them to PP and/or Pandoc.

YAML files can easily be merged with the markdown source files in memory, and then be fed to PP or Pandoc via STDIN. When YAML blocks contain a variable already defined in a previous YAML block, Pandoc operates on a lef-biased principle, so the first definition is the one that wins — therefore YAML order matters, and YAML blocks are chained thus:

  1. Source doce YAML header
  2. Associated filename.yaml
  3. folder's YAML setting file

NOTE: By loading PP-Macors definitions ahead of YAML blocks, PP-Macros can be placed in YAML definitions, allowing to create dynamic/conditional YAML variables. This is great if planning, for example, to also output the documentation in a different format and using different Pandoc templates.

Handling Macros-Definitions

As for the macros definitions, it's not currently possible to feed macros definitions (in a non-emitting manner) to PP via STDIN — ie: separately from the source file to be processed.

So I've come across two possible solutions.

  1. from CLI — via -import=FILE option
  2. source injection — inject before source doc via !quiet(TEXT) macro

The 1st approach has the disadvantage that custom macros file(s) have to be loaded from disk every time a file is built — their content can’t be passed via STDIN!

The 2nd approach has the advantage that macros definition can be loaded from file(s) to memory just once, when the CMS is launched, and then injected at the beginning of the markdown document inside the !quiet() macro. Example:

    !quiet
    ~~~~~~~~~~
    [macros definitions from memory]
    ~~~~~~~~~~

… this should be much faster — but more memory expensive! — because it would avoid lot’s of redundant disk accesses.

NOTE 1: This approach requires all macros to be defined in a single file (instead of multiple modules being imported from a core file using !import macros). But having all macros definition in a single file is somewhat more bothersome to maintain.

NOTE 2: I could still keep macros definition in multiple files with the second approach — if they follow a standard naming convention (eg: having a *.pp extension), they can be sequentially loaded in memory by the CMS and merged into a single datablock to inject in the !quiet() macro.

NOTE 3: The !quiet() macro injection approach prevents using pandoc style headers (inside source docs) because they would no longer be at the beginning of the source file — a small price to pay. That is, unless the CMS could parse the source doc and inject the macros definition after the pandoc headers (requires some work and might slow things down). On the other hand, YAML headers and blocks are not affected by this.

Conclusions

From working on the CMS prototype, I've realized that the following proposed behaviour might simplify using PP in similar contexts. I'm not sure if this is even possible, but it would be great if PP could accept two independent STDIN streams:

Basically, if there was an alternative to the -import=FILE option capable of accepting a STDIN stream (no text emitted) and then carry on with the rest of the PP's invocation line, it would allow to feed macros definitions to PP without resorting to the above mentioned injections techniques.

Something along these lines:

pp -import-stdin 

... where PP waits for a first STDIN stream (which it treats like it would with -import=FILE), and after that waits for a second STDIN stream, which it processes regularly (as by default).

It could also be used in conjuction with source files from disk, still allowing the macros to be fed via STDIN:

pp -import-stdin somefile.md

... where PP first gets the macros definition from STDIN (supplied from memory by some app), and then process some file from disk, as usual. This prevents redundant access to disk for getting macros definition.

Other areas of PP improvement might be in relation to YAML blocks. In complex projects involving Pandoc, YAML headers and files seem the natural solution to handling settings inheritance and overriding. Any built-in macros that might support working directly with YAML definitions, blocks, or files could make a big difference and is a potential area of interest.

For example, functionality to merge YAML definitions of same variable from different files into a cumulative array of values, instead of having pandoc ignore subsequent definitions. For example, keywords could be inherited and merged from YAML hierarchy, building up a more specifical subset with each subfolder level.

CDSoft commented 7 years ago

I don't think that redundant disk access are a problem since the OS is supposed to cache files in memory. Unless you deal with huge files, they will be read once (until the OS flushes the disk cache) for all subsequent pp executions (or any other process).

Processes have only one stdin port (port 0 is stdin, port 1 is stdout, port 2 is stderr, other ports may be used for any file or pipe open by the process). pp works as a unix filter. It reads stdin (and/or parameters) and outputs the result on stdout. -import is the only way to get some macro definitions without polluting stdin.

pp parses text and macros but not YAML (I don't think that mixing parser would be a good idea). If you want pp to generate and/or merge YAML definitions you can either use external tools or macros that generate the YAML structure.

tajmone commented 7 years ago

I don't think that redundant disk access are a problem since the OS is supposed to cache files in memory. > Unless you deal with huge files, they will be read once (until the OS flushes the disk cache) for all subsequent pp executions (or any other process).

Good to know, I wasn't aware of this. This would make things much easier, and spare me memory managing all those files. I'm using a cross-platform (Win, Mac, Linux) language for the CMS, all filesystem operations are handled by the language's libraries but I should be able to verify peformance with the lang's debugger and purifier.

Then the double STDIN stream wouldn't be an issue at all since all operations would be from file sources.

pp parses text and macros but not YAML (I don't think that mixing parser would be a good idea).

I know that this would mean implementing a full YAML parser on top of PP, which is beyond the scope of a generic pre-processor. But then, if there was an enhancement in that direction to be made I guess YAML would be the number one choice. As I previously pointed out in issue #9, I've found some external tools to parse YAML contents to env-vars, and they provided rather useful. But possibly it might be even better to go for pandoc filters then.

Thanks for the info, it came at a good time and spared me investing lots of time in the wrong direction. I really think that PP can bring great innovation in pandoc-driven CMSs'. For script based CMS, I think that two good candidate base project are:

The former is a strong CMS build specifically around Pandoc, and accepts preprocessors by default (haven't had yet time to test if it integrates well with PP).

The latter is particularly promising because is a highly customizable and plugin-driven multipurpose CMS framework. It already has a pandoc wrapper (metalsmith-pandoc) and would only require a PP plugin to make the magic happen.

The good thing about PP + Pandoc is that you have a simple base (markdown or rst, which have simple rules) and you add your custom macros taylored to your needs. So you don't have the cumbersomeness of big applications (like Asciidoc/AsciiDoctor, and their complex rules), but still enjoy the freedom of acomplishing whatever you want by reaching for external tools or scripts. Once a project is set up, with all its PP macros in place, maintainance is really easy — and by changing a single macro you can affect the whole project's output. And the good part is that it doesn't involve programming skills nor knowledge of any specific language, so you can expect a wider contributors pool.

tajmone commented 7 years ago

Thanks a lot for this tip Christophe! I did some research on the topic to fill the gap, and realized it would be pointless to implement all those manual memory buffers, and from the documentation of the language I'm currently using for the project I see that I can set the FileBuffer size (or disable buffering altogether) for all file I/O operations, and even force to flush buffers in writing operations — and this works cross-platform.

So my app should benefit from any OS' built-in file caching system, and the language handles it transparently with its file library.