HTML.create_element handles void elements incorrectly

delan commented 3 months ago

Void elements only have a start tag; end tags must not be specified for void elements.

The following yields <!DOCTYPE html><img></img> . For the <img> this should have no negative effects, but for the this parses as two elements per #parsing-main-inbody of the HTML spec.

HTML.delete_content(page)
HTML.append_root(page, HTML.create_element("img"))
HTML.append_root(page, HTML.create_element("br"))

One workaround to get <!DOCTYPE html><img>  is to use HTML.parse():

HTML.delete_content(page)
HTML.append_root(page, HTML.parse("<img>"))
HTML.append_root(page, HTML.parse("<br>"))

In some situations, the HTML.parse() needs to be wrapped in a HTML.select_one():

-- [ERROR] Could not process page: Expected an HTML element node, but found a document
-- img = HTML.parse("<img>")
img = HTML.select_one(HTML.parse("<img>"), "*")
HTML.set_attribute(img, "src", "https://soupault.app/images/soupault_logo.svg")
HTML.append_root(page, img)

dmbaturin commented 2 months ago

I couldn't reproduce this issue with 4.10.

[dmbaturin@alcor ~/d/t/brtest]$ soupault --version
soupault 4.10.0

Copyright 2024 Daniil Baturin et al.
soupault is free software distributed under the MIT license.
Visit https://www.soupault.app for news and documentation.

Compiled with OCaml 4.14.2

[dmbaturin@alcor ~/d/t/brtest]$ cat soupault.toml 

# To learn about configuring soupault, visit https://www.soupault.app/reference-manual

[settings]
  # Soupault version that the config was written/generated for
  # Trying to process this config with an older version will result in an error message
  soupault_version = "4.10.0"

  # Stop on page processing errors?
  strict = true

  # Display progress?
  verbose = true

  # Display detailed debug output?
  debug = false

  # Where input files (pages and assets) are stored.
  site_dir = "site"

  # Where the output goes
  build_dir = "build"

  # Files inside the site/ directory can be treated as pages or static assets,
  # depending on the extension.
  #
  # Files with extensions from this list are considered pages and processed.
  # All other files are copied to build/ unchanged.
  #
  # Note that for formats other than HTML, you need to specify an external program
  # for converting them to HTML (see below).
  page_file_extensions = ["htm", "html", "md", "rst", "adoc"]

  # By default, soupault uses "clean URLs",
  # that is, $site_dir/page.html is converted to $build_dir/page/index.html
  # You can make it produce $build_dir/page.tml instead by changing this option to false
  clean_urls = true

  # If you set clean_urls=false,
  # file names with ".html" and ".htm" extensions are left unchanged.
  keep_extensions = ["html", "htm"]

  # All other extensions (".md", ".rst"...) are replaced, by default with ".html"
  default_extension = "html"

  # Page files with these extensions are ignored.
  ignore_extensions = ["draft"]

  # Soupault can work as a website generator or an HTML processor.
  #
  # In the "website generator" mode, it considers files in site/ page bodies
  # and inserts them into the empty page template stored in templates/main.html
  #
  # Setting this option to false switches it to the "HTML processor" mode
  # when it considers every file in site/ a complete page and only runs it through widgets/plugins.
  generator_mode = true

  # Files that contain an <html> element are considered complete pages rather than page bodies,
  # even in the "website generator" mode.
  # This allows you to use a unique layout for some pages and still have them processed by widgets.
  complete_page_selector = "html"

  # Website generator mode requires a page template (an empty page to insert a page body into).
  # If you use "generator_mode = false", this file is not required.
  default_template_file = "templates/main.html"

  # Page content is inserted into a certain element of the page template.
  # This option is a CSS selector that is used for locating that element.
  # By default the content is inserted into the <body>
  default_content_selector = "body"

  # You can choose where exactly to insert the content in its parent element.
  # The default is append_child, but there are more, including prepend_child and replace_content
  default_content_action = "append_child"

  # If a page already has a document type declaration, keep the declaration
  keep_doctype = true

  # If a page does not have a document type declaration, force it to HTML5
  # With keep_doctype=false, soupault will replace existing declarations with it too
  doctype = "<!DOCTYPE html>"

  # Insert whitespace into HTML for better readability
  # When set to false, the original whitespace (if any) will be preserved as is
  pretty_print_html = true

  # Plugins can be either automatically discovered or loaded explicitly.
  # By default discovery is enabled and the place where soupault is looking is the plugins/ subdirectory
  # in your project.
  # E.g., a file at plugins/my-plugin.lua will be registered as a widget named "my-plugin".
  plugin_discovery = true
  plugin_dirs = ["plugins"]

  # Soupault can cache outputs of external programs
  # (page preprocessors and preprocess_element widget commands).
  # It's disabled by default but you can enable it and configure the cache directory name/path
  caching = false
  cache_dir = ".soupault-cache"

  # Soupault supports a variety of page source character encodings,
  # the default encoding is UTF-8
  page_character_encoding = "utf-8"

# It is possible to store pages in any format if you have a program
# that converts it to HTML and writes it to standard output.
# Example:
#[preprocessors]
#  md = "cmark --unsafe --smart"
#  adoc = "asciidoctor -o -"

# Pages can be further processed with "widgets"

# Takes the content of the first <h1> and inserts it into the <title>
[widgets.page-title]
  widget = "title"
  selector = "h1"
  default = "My Homepage"
  append = " &mdash; My Homepage"

  # Insert a <title> in a page if it doesn't have one already.
  # By default soupault assumes if it's missing, you don't want it.
  force = false

# Inserts a generator meta tag in the page <head>
# Just for demonstration, feel free to remove
[widgets.generator-meta]
  widget = "insert_html"
  html = '<meta name="generator" content="soupault">'
  selector = "head"

# <blink> elements are evil, delete them all
[widgets.no-blink]
  widget = "delete_element"
  selector = "blink"

  # By default this widget deletes all elements matching the selector,
  # but you can set this option to false to delete just the first one
  delete_all = true

[widgets.test]
  widget = "test"

[dmbaturin@alcor ~/d/t/brtest]$ cat plugins/test.lua 
HTML.delete_content(page)
HTML.append_root(page, HTML.create_element("img"))
HTML.append_root(page, HTML.create_element("br"))
[dmbaturin@alcor ~/d/t/brtest]$ soupault 
[INFO] Starting soupault 4.10.0 in website generator mode
[INFO] Loading plugins
[INFO] Loading widgets
[INFO] Loading hooks
[INFO] Starting website build
[INFO] Processing page site/index.html
[INFO] Using the default template for page site/index.html
[INFO] Processing widget generator-meta on page site/index.html
[INFO] Processing widget page-title on page site/index.html
[INFO] Processing widget test on page site/index.html
[INFO] Processing widget no-blink on page site/index.html
[INFO] Writing generated page to build/index.html

[dmbaturin@alcor ~/d/t/brtest]$ cat build/index.html 
<!DOCTYPE html>
<img><br>

dmbaturin commented 2 months ago

Hmm, one idea: does the original page has a doctype that allows void elements, like <!DOCTYPE html>? The doctype does affect the parsing and rendering mode selection in Markup.ml/LambdaSoup.

PataphysicalSociety / soupault

HTML.create_element handles void elements incorrectly #66