adobe / helix-html-pipeline

A library for rendering the html response in Helix3.
https://www.hlx.live/
Apache License 2.0
13 stars 16 forks source link

Icons no longer working after #704 change in icon regex #708

Closed ahmed-musallam closed 2 weeks ago

ahmed-musallam commented 2 weeks ago

Description After the fix in https://github.com/adobe/helix-html-pipeline/pull/704 what used to be considered icons in our site, is no longer considered icons.

See example page here: https://main--exlm-prod--adobe-experience-league.hlx.page/en/docs/experience-manager-cloud-service/content/overview/introduction

This is the source markup:

https://experienceleague.adobe.com/api/action/convert/en/docs/experience-manager-cloud-service/content/overview/introduction

see that header is:

An Introduction to Adobe Experience Manager as a Cloud Service :headding-anchor:an-introduction-to-adobe-experience-manager-as-a-cloud-service

The issue seems to be the lack of space after the trailing :

But this used to work before the change and now it does not. It impacts a large set of pages because we use that icon to indicate anchors - this is a separate topic entirely but we did depend on how the pipeline generated icons. this would also fail with other icons if there is a preceding space but not a trailing one..

royfielding commented 2 weeks ago

FYI, the old regex (from 2023) was

const REGEXP_ICON = /(?<!(?:https?|urn)[^\s]*):(#?[a-z_-]+[a-z\d]*):/gi;

and the current one is

const REGEXP_ICON = /(?<!https?:\/\/[^\s]*):(#?[a-z_\d-]+):(?!\w)/gi;

which I think has multiple problems.

  1. The first part in both is allowing an optional angle bracket. Why?
  2. The second part in old is trying to exclude URIs using an expensive and incorrect glob.
  3. The second part in current is optionally excluding URIs that don't start with h.
  4. The third part is supposed to be matching the icon identifier if it starts with #.
  5. The last part of current allows a not-word character?
tripodsan commented 2 weeks ago

The first part in both is allowing an optional angle bracket. Why?

no, this is a negative lookbehind.

The second part in old is trying to exclude URIs using an expensive and incorrect glob.

no, just a cheap way of preventing url like strings

The second part in current is optionally excluding URIs that don't start with h.

same as above, preventing http(s):

The third part is supposed to be matching the icon identifier if it starts with #.

no. the # is optional. this was once and indicator for sprite sheets.

The last part of current allows a not-word character?

it's a negative look ahead. specifying that the pattern must not be followed by a word character. (this is what broke the site)

if you look at the tests, you see why the regexp is (and was) the way it is currently:

https://github.com/adobe/helix-html-pipeline/blob/main/test/fixtures/content/icons.md https://github.com/adobe/helix-html-pipeline/blob/main/test/fixtures/content/icons-ignored.md


anyways, since we reverted the change, we can close this issue for now.