[epic] MarkdownDB plugin system

rufuspollock commented 1 year ago

We want a plugin system in MarkdownDB so people can easily extend the core functionality, for example to extract additional metadata, so that not all functionality has to be in core and people can rapidly add functionality

Sketch (April 2023)

https://link.excalidraw.com/l/9u8crB2ZmUo/9hkrQmVl9QX

Acceptance

[ ] Identify the different types of plugins ✅2023-11-19 roughly: parsing, computing, validating (and maybe serializing ...)
[ ] Research how remark works to see if we can reuse it 🚧2023-11-19 see notes in comment below
[ ] Design of MarkdownDB and especially the plugin system.
- [ ] extract first heading as title metadata
- [ ] add a metadata field

Notes

MarkdownDB vs Contentlayer

Contentlayer supported:

document types with
- frontmatter schema definition and validation
- assigning document types based on glob patterns
- computed fields, e.g. description auto-extracted from the document content
excluding/including some content folders we kinda already have this but it's not configurable
...

What we need:

probably config file similar to Contentlayer one, with:
- custom document types,
- content include/exclude option
- plugins
- ...
...

rufuspollock commented 11 months ago

Doing a bunch of research on remark and micromark re the parsing part of this - could remark be our plug in system here? (probably)

[x] Should we just build on top of the remark ecosystem i.e. use remark plugins for doing the parsing? ✅2023-11-19 my sense is yes
- [x] Should we use remark plugins or micromark (what's the difference even?). 🚧2023-11-19 still confused on this one (as others are) but my sense is we just remark and its plugins
[ ] How do you create a plugin 🚧2023-11-19 see https://github.com/remarkjs/remark/blob/main/doc/plugins.md and it's guide
- How do you pass data around? see notes below (no answer yet!) 🚧2023-11-19 there is something called messages ...
[ ] What remark plugins could we learn from?
- For tasklists: https://github.com/micromark/micromark-extension-gfm-task-list-item
- How would we extract tags?

Can you pass "data" along the chain of a plugin

This example https://github.com/remarkjs/remark/issues/251 talks about word counts but it console logs the info ...

var unified = require('unified');
var parse = require('remark-parse');
var stringify = require('remark-stringify');
var english = require('retext-english');
var remark2retext = require('remark-retext');
var visit = require('unist-util-visit');

unified()
  .use(parse)
  .use(remark2retext, unified().use(english).use(count))
  .use(stringify)
  .processSync('*This* and _that_. \n> And some more stuff.\n\nAnd another thing.');

function count() {
  return counter;
  function counter(tree) {
    var counts = {};
    visit(tree, visitor);
    console.log(counts);
    function visitor(node) {
      counts[node.type] = (counts[node.type] || 0) + 1;
    }
  }
}

{ RootNode: 1,
  ParagraphNode: 3,
  SentenceNode: 3,
  WordNode: 10,
  TextNode: 10,
  WhiteSpaceNode: 10,
  PunctuationNode: 3 }

mohamedsalem401 commented 11 months ago

The immediate question that arises is how the output of running plugins can be stored. Let's consider a straightforward example using a simple plugin available at https://github.com/florianeckerstorfer/remark-a11y-emoji. This plugin wraps emojis in a <span> tag and sets the emoji name as the aria-label.

Assuming we successfully run the markdown files through such plugins, the next query is where the newly generated markdown should be stored. Currently, the library only generates SQL databases from metadata, lacking a method to load the content of a file.

Possible solutions include:

Add Content to Database/JSON: Store each file's body content in the generated database or local JSON files. This approach consolidates the parsed content along with metadata.
Generate Separate Markdown Files: Create a designated folder, say .markdown, and start generating markdown files there after parsing. This process involves removing metadata from the files.
Introduce a Loading Method: Implement a method like loadFile(file_path) to retrieve the content of a given file after running the plugins. However, a drawback of this approach is that if users generate the database/JSON files using the library but employ another tool to load the markdown file content.

rufuspollock commented 11 months ago

@mohamedsalem401 we aren't using plugins to transform markdown at all - we are using plugins to extract information from the markdown and then store that somewhere ...

See my last comment section about "Can you pass "data" along the chain of a plugin" ... because we just want to pass data along the chain. Or see the example above where it computes wordcount etc.

To repeat: we are not using remark plugins to transform the content but rather to extract information from it ...

datopian / markdowndb