Meta/Header Data Strips

sgtcoder commented 7 months ago

This issue proposes a bug or lack of knowledge on DOMPurify?

Background & Context

The head/meta information gets stripped out. Here is an example snippet (yes I was pretty much whitelisting everything to see if I can get the content to not strip). I looked through the source code and researched and couldn't find anything that worked. It works fine without using DOMPurify.

const WHITELISTED_ATTR = [
  "content",
  "datetime",
  "itemprop",
  "name",
  "property",
  "type",
  "id",
  "class",
];

const WHITELISTED_TAGS = [
  "iframe",
  "video",

  "time",
  "meta",
  "head",
  "title",
  "script", // application/ld+json

  "article",
  "span",
  "link",

  "annotation-xml",
  "audio",
  "colgroup",
  "desc",
  "foreignobject",
  //"head",
  //"iframe",
  "math",
  "mi",
  "mn",
  "mo",
  "ms",
  "mtext",
  "noembed",
  "noframes",
  "noscript",
  "plaintext",
  //"script",
  "style",
  "svg",
  "template",
  "thead",
  //"title",
  //"video",
  "xmp",

  "p",
  "#text",
];

const domPurifyOptions = {
  ADD_ATTR: WHITELISTED_ATTR,
  ADD_TAGS: WHITELISTED_TAGS,
  WHOLE_DOCUMENT: true,
  SANITIZE_DOM: false,
};

Bug

Input

Some HTML which is thrown at DOMPurify. Contents of this https://techcrunch.com/2024/01/22/germanys-instagrid-which-uses-software-to-supercharge-portable-batteries-raises-95m

Given output

The output given by DOMPurify. (TRIMMED)

\n\n<div role=\"presentation\" style=\"height: 0; overflow: hidden;\">\n\t<div class=\"premium-content__logo\">\n\t<div class=\"logo\">\n\t\t<a href=\"/\">\n\t\t\t<svg style=\"display: block;\" version=\"1.1\" viewBox=\"0 0 60 30\" height=\"15px\" width=\"30px\">\n\t\t\t\t<title>TechCrunch</title>\n\n\t\t\t\t<defs>\n\t\t\t\t\t<linearGradient id=\"tc-gradient\" y2=\"0.571986607%\" x2=\"0.571986607%\" y1=\"100%\" x1=\"100%\">\n\t\t\t\t\t\t<stop offset=\"0%\" stop-color=\"#00D301\"></stop>\n\t\t\t\t\t\t<stop offset=\"50%\" stop-color=\"#36C275\"></stop>\n\t\t\t\t\t\t<stop offset=\"100%\" stop-color=\"#00A562\"></stop>\n\t\t\t\t\t</linearGradient>\n\t\t\t\t\t<mask height=\"29.5\" width=\"59.5\" y=\"0\" x=\"0\" maskUnits=\"userSpaceOnUse\"

Expected output

The header with the author meta tags.

<meta property="article:published_time" content="2024-01-19T00:53+00:00">

Feature

I was hoping to preserve the meta tags so when it gets sent to Readability.js, it can parse the meta information of the article.

cure53 commented 7 months ago

Have you tried adding the FORCE_BODY config option per chance?

sgtcoder commented 7 months ago

The solution was to use DOMPurify after parsing with readability.js.

cure53 / DOMPurify