Proposed enhancement: Automatically generate post metadata from plain markdown file

simon-brooke commented 1 year ago

Currently, Cryogen depends on an EDN formatted map as being the first textual item in a file which is otherwise a markdown file. This gives the peculiarity that Cryogen posts are awkward in normal Markdown editors.

If this map is not present, cryogen-core.compiler/parse-post throws an exception at line 177, which could trivially be caught.

It seems to me that it would be straightforward, at least on UN*X platforms, to write a function which could extract key metadata directly from a plain Markdown file, and automagically create the map. for example,

author can be derived from the 'real name' field in /etc/passwd for the currently logged in user;
title can be derived from the first top level heading (i.e. first line beginning '# ') in the file;
date can be derived by
1. pattern matching the filename against ^\(dddd-dd-dd\) and, if matched, using that; or
2. using the file's creation date;
tags could be derived by finding the first instance in the file of a line matching ^\*\*Tags:\*\*\(.*\) and treating the remainder of that line as a comma-separated line of tags.

More interestingly and more in line with what I'm working on now is that it could derive description from the first non-header paragraph in the file, and a map such as

:image {:path  "/img/uploads/refugeeswelcome.png"
        :alt "White text, 'Can you imagine a land where refugees are welcome? Yes!' over a blue toned monochrome image showing Roza Salih, Kurdish refugee, being elected as a Glasgow City councillor."
        :width 600
        :height 424
        :type "image/jpeg"}

by examination of the first line matching `^![\([^]]*\)\](\([^)]*\)), and the file indicated by the second match in that pattern.

This then allows you to add the following to base.html in any theme:

    <meta property="og:site_name" content="{{title}}" />
    <link href="{{rss-uri}}" rel="alternate" type="application/rss+xml" title="RSS 2.0" />
    {% block meta %}
    <meta property="og:title" content="Home" />
    <meta property="og:type" content="website" />
    <meta property="og:image" content="/blog/img/home.png" />
    <meta name="description" content="{{description}}" />
    <meta name="keywords"
        content="{% for tag in tags|sort-by:name %}{{tag.name}}{% if not forloop.last %},{% endif %}{% endfor %}" />
    {% endblock %}

and the following to the post.html and page.html of any theme:

{% if post.description %}
    <meta name="description" content="{{post.description}}"/>
    {% comment %} OpenGraph tags {% endcomment %}
    <meta property="og:description" content="{{post.description}}"/>
{% endif %}
{% if post.image %}
    {% if post.image.path %}<meta property="og:image" content="{{site-url}}{{blog-prefix}}{{post.image.path}}"/>
    {% if post.image.type %}<meta property="og:image:type" content="{{post.image.type}}"/>{% endif %}
    {% if post.image.width %}<meta property="og:image:width" content="{{post.image.width}}"/>{% endif %}
    {% if post.image.height %}<meta property="og:image:height" content="{{post.image.height}}"/>{% endif %}
    {% if post.image.alt %}<meta property="og:image:alt" content="{{post.image.alt}}"/>{% endif %}
    {% else %}<meta property="og:image" content="{{site-url}}{{blog-prefix}}{{post.image}}"/>{% endif %}
{% endif %}
    <meta property="og:url" content="{{site-url}}{{uri}}"/>
    <meta property="og:title" content="{{post.title}}"/>
    <meta property="og:type" content="article"/>
    <meta name="twitter:card" content="summary_large_image"/>
    <meta name="twitter:domain" content="journeyman.cc"/>
    <meta name="twitter:url"
        content="{{site-url}}{{uri}}"/>
    <meta name="twitter:title" content="{{post.title}}"/>
    <meta name="twitter:description" content="{{post.description}}"/>
    {% if post.image %}
    <meta name="twitter:image" content="{{site-url}}{{blog-prefix}}{% if post.image.path %}{{post.image.path}}{% else %}{{post.image}}{% endif %}"/>
    {% endif %}

and thus generate valid OpenGraph meta-tags as seen here.

I'm almost certainly going to do this for my own use anyway. Would a pull request with this as an enhancement be accepted?

It could be implemented in one of two ways:

When a file with no map was encountered, it could be modified to add the map and written back into the [posts|pages] directory; or
Each time a file with no map was encountered, cryogen-core.compiler/parse-post could automagically create the map on the fly.

The second solution would have the advantage that the markdown file would not be altered, and thus would render nicely in a markdown editor; but it would obviously be substantially slower.

yogthos commented 1 year ago

Yeah, I think that'd be a sensible thing to do, and I prefer the second solution as well. I don't expect performance should be a big issue in practice, and avoiding editing the original file seems like it would be best. I generally prefer not mixing user edited and generated content.

simon-brooke commented 1 year ago

Would it be acceptable to you if, as part of this solution, I move cryogen-core.compiler/parse-post-date to cryogen-core.util? It seems better to me that:

I keep my changes in their own namespace;
I don't reinvent how to parse post dates.

This requires that this function should be in a common namespace accessible to both cryogen-core.compiler and my proposed cryogen-core.infer-meta, and cryogen-core.util seems appropriate.

yogthos commented 1 year ago

Yeah, that makes sense to me. 👍

simon-brooke commented 1 year ago

The reasonably reliable way of detecting the mime types of image files, needed for Open Graph meta tags, is to use Apache Tika via its clj-tika wrapper. However, this drags in an enormous and heavyweight stack of other libraries.

Similarly, the way I'm used to of detecting image sizes, again used in Open Graph meta tags, is by using Mike Anderson's mikera/imagez library, but this too is not lightweight.

We do not have to generate rich Open Graph data, but (for my own purposes) I'd like to. What are your feelings about this?

simon-brooke commented 1 year ago

H'mmm... Pantomime seems to now be preferred over clj-tika, but it doesn't change the argument: these are heavyweight libraries to be including for what is a marginal gain. Should I do this?

simon-brooke commented 1 year ago

Progress report: it's doing everything I want except inferring the author's real name. I have found (different) hacks for doing this on Linux, MacOS and Windows, and could write a little wrapper around all three; but given that we already have a :author key in the standard config.edn, this may be a bridge too far.

Thoughts?

yogthos commented 1 year ago

I think pulling author info from the config would make sense.

simon-brooke commented 1 year ago

Just to do a progress report: this now works, except for a couple of minor issues:

the h1 line used to infer the title remains in the document, so the title is shown twice in the output (I can fix this without modifying compiler.clj, but it would require a modification to all themes);
the **Tags: line used to infer the tags remains in the document, but does not have the requisite links; and I can't fix that without filtering the line out in content-dom->html -- which I can do, but only if I pass page meta-data in in params.

I'm currently adding :inferred-meta true to the meta-data of all pages which don't contain embedded meta-data.

I would suggest that it might be worth memoising page-content, since it is called multiple times on the same page during the compilation process and has some compute cost.

In summary: I still intend to proceed with this for my own purposes, but it's becoming less of a small, tactical fix than I had hoped. Would a pull request still be welcome? Work in progress is here.

yogthos commented 1 year ago

I think it might be better to modify the compiler to allow existing themes to work, would make it easier for people to upgrade. And agree with memoising, there's no point reading the info over and over since it's used repeatedly. I think a PR would still be welcome, it looks like most changes are in a new namespace, and it's an opt in feature.

@lacarmen thoughts?

cryogen-project / cryogen-core

Proposed enhancement: Automatically generate post metadata from plain markdown file #161