PataphysicalSociety / soupault

Static website generator based on HTML element tree rewriting
https://soupault.app
MIT License
371 stars 18 forks source link

Schema.org structured data support (example with org-mode file). #39

Open MorphicResonance opened 2 years ago

MorphicResonance commented 2 years ago

Processing various metadata from org-mode is part of my trial to create org structured, "content first" and search engine friendly web pages. But the question is bigger and for all users. With soupault we choose the conventional way of html formatting for design and abandoned templates, their variations and other opinionated hints. The widely used and conventional way of machine executable content markup is developed by schema.org. After some progress think here is simple requirements for the plugin that should transfer metatags into <head> section of web page.

  1. Data for metatags like title and meta-description
  2. Machine-executable data for search engine robots as json-ld.

the good news is that it's likely to be possible to convert the just input text without having to write a separate yaml block for json-ld. I talked to the developers from stencil, they'll took care of it.

There is only the 1st task with extraction data for meta tags. And the second item is decided by the converter. So,

My case is without json-ld and operate with microdata. Just take note about json-ld case. this is input file:

+begin_example

#+meta_title: this is a title of the page
#+meta_description: this is a metadescription of the page
#+title: A simple Org Mode article for testing
#+author: Nokome Bentley

* Introduction

A simple Org Mode article for testing. When making changes please note
that test snapshots based on this fixture may need to be updated.

* Methods

This is the methods section.

* Results

The results include a table (Table 1).

| Group | Value |
|-------+-------|
| A     | 1.1   |
| B     | 2.2   |

* Discussion

This is the discussion section.

+end_example

Plugin should take this is a title of the page from #+meta_title:. and this is metadescription of the page from #+meta_description: . If #+meta_title: is not exist then take data from the #+title: (it means that web page title and article title will have identical titles in this case).

Then delete these strings with #+meta_... completely and leave other as is (#+title: should be left). Other properties will be applied by converter for microdata markup.

Then place value of title/metadescription variables into title/metadescription tags of the page.

<head>
....
<title>{{meta_title}}</title>
<meta name="description" content="{{meta_description}}" />
....
</head>

this is basic version of the plugin since converting from org-mode to html by stencila is in development. But it is clear the way plugin should be written, don't think there will be much difference from above.

dmbaturin commented 2 years ago

I haven't forgotten your request.

Please remind me, the title field should do to the page <title> in its <head>, but what exactly do you want to do with other fields?

Ideally, I'd like to see examples of source pages in the Org format and hand-written mockups of output pages you want to produce from them.

MorphicResonance commented 2 years ago

As soupault requires single file as potential page in input, the structure of input org-file should be similar .

  1. block of metatags and data for webpage escaped from pandoc
  2. body , paragraphs of text, other complex html blocks formatted as sheme objects & pandoc escaped. Escaping from pandoc provided by including our data into tags
    #+BEGIN_EXPORT html
    our data maybe html fromatted being escaped from pandoc. 
    May include microdata with sheme objects. 
    video , audio others for embedding into article body of the page template.
    #+END_EXPORT

    so lets provide our data from input .org file -----start of org file------

    
    #+BEGIN_EXPORT html
    <site-meta-data>
    #+title: post 1 title
    #+subtitle: Post 1 subtitle
    #+description: Post 1 description
    #+author: Billy
    #+date: 2021-11-03
    #+datepublished: 2021-06-02
    #+usertags: fish, animal
    #+summary: Post 1 summary
    #+id: 1-test1com
    </site-meta-data>
    #+END_EXPORT

Fish are aquatic, craniate, gill-bearing animals that lack limbs with digits. They form a sister group to the tunicates, together forming the olfactores. Included in this definition are the living hagfish, lampreys, and cartilaginous and bony fish as well as various extinct related groups. Around 99% of living fish species are ray-finned fish, belonging to the class Actinopterygii, with over 95% belonging to the teleost subgrouping. sentence. ** test heading 1 text 1 *** heading 2 text 2 Inermis indoctum vis in, has soleat complectitur te.

+BEGIN_EXPORT

    <div itemprop="video" itemscope itemtype="https://schema.org/VideoObject">
      <video controls poster="/video/big_buck_bunny.jpg">
        <source itemprop="contentUrl" type="video/mp4" src="/video/big_buck_bunny.mp4">
        <source itemprop="contentUrl" type="video/webm" src="/video/big_buck_bunny.webm">
        I’m sorry, your browser doesn’t support HTML5 video in MP4 with H.264 or WebM with VP8/VP9.
      </video>
      <p><small>Video copyright 2008, Blender Foundation / www.bigbuckbunny.org.</small></p>
      <meta itemprop="name" content="Video example">
      <meta itemprop="description" content="An example HTML5 video file.">
      <meta itemprop="duration" content="T60S">
      <meta itemprop="uploadDate" content="2018-09-21T10:44:26Z">
      <meta itemprop="thumbnailUrl" content="/video/big_buck_bunny.jpg">
    </div>

+END_EXPORT

** conslusiom bye

------end of input org file---------

So after processing by pandoc org heading become html heading, 
escaped html will be present as it was and paragraphs 
will became html  formatted `<p>..</p>`

For detecting end extracting data by soupault I inluded it into
special tags `<site-meta-data></site-meta-data>` . I wish soupault
extract it as array and include different by the goal data into
metatags of the page template and body of the page also.

Basically block with the data between
`<site-meta-data></site-meta-data>` contains data which are inserted
into different places of template usually not into the article body
part because we write body successively with different [embeded scheme objects](https://github.com/philwareham/schema-microdata-examples/blob/main/blog.html).

So if we apply such template with .org file as web page

<!DOCTYPE html>

``` let soupault append some of our extracted data as metatags for webpage. Title and metadesciptions have jumped into section dates as such `#+date: 2021-11-03`, `#+datepublished: 2021-06-02` jumped into scheme itemprops as datePublished and dateModified. "Post 1 summary" from #+summary: has jumped in section "introduction" itemprop="text". Billy from `#+author:` jumped into itemprop author->name. ``` post 1 title `

post 1 title

Post 1 summary

Fish are aquatic, craniate, gill-bearing animals that lack limbs with digits. They form a sister group to the tunicates, together forming the olfactores. Included in this definition are the living hagfish, lampreys, and cartilaginous and bony fish as well as various extinct related groups. Around 99% of living fish species are ray-finned fish, belonging to the class Actinopterygii, with over 95% belonging to the teleost subgrouping. sentence.

test heading 1

text 1

heading 2

text 2

Inermis indoctum vis in, has soleat complectitur te.

Video copyright 2008, Blender Foundation / www.bigbuckbunny.org.

conslusiom

bye

... ``` So as we are using template for dominating content style and inserting some own data into it. Note not all the data from input (id, usertags) were used in this template and this case therefore user should be able to define what extracted named parts of array soupault will insert into template and where.
dmbaturin commented 2 years ago

Could you confirm or deny the following: an org-mode metadata entry will always start with #+, will always contain a string, and will always end with a newline? That is, will #\+(.*)\n be a safe regex for extracting metadata entries?

Since soupault 4.0.0 supports a pre-parse hook, it's now possible to reimplement various types of front matter with that hook. Since that hook works on the page source before it's parsed and before it's decided whether it will be indexed or not, it will also have to produce text.

Does something like this look good to you? I assume the plugin should always put the rendered HTML before the page body. Let me know what you think.

[hooks.pre-parse]
  file = "hooks/org-mode-metadata.lua"
  template = """
    <h1 id="post-title">{{title}}</h1>
    ...
  """
MorphicResonance commented 2 years ago

Yes metatags always start with #+ and ended with newline.names from values are delimited a:. I don't see how it can be done with pre-parse hook since we need extract values for metatags, save them to somekind of global variables, delete these strings and send values from them into html tree then. So preparse hook is working only for deleting string with metatags. I see the variant with render as unified version of pandoc's "in the middle" lua filters. But it just the same dance with fake tags as I wrote long time ago.

MorphicResonance commented 2 years ago

Yes metatags always start with #+ and ended with newline.names from values are delimited a:. I don't see how it can be done with pre-parse hook since we need extract values for metatags, save them to somekind of global variables, delete these strings and send values from them into html tree then. So preparse hook is working only for extracting/deleting string with metatags. I see the variant with render as unified version of pandoc's "in the middle" lua filters. But it just the same dance with fake tags as I wrote long time ago.