Parser of html entities is munging too much input - returning errors when & is part of normal text

jhrcek commented 2 years ago

Hello, thank you for working on this package, our NeuroBlu analytics platform we're building at Holmusk is making heavy use of it and so far we didn't encounter any major issues. Only today we encountered an issue with the markdown parsing logic.

I recreated the issue based off of one of your examples in ellie: https://ellie-app.com/hcCTpkbfLM7a1

The issue is that whenever there's & character within some html block, the parser is always trying to parse it as HTML entity even in cases where it's just part of normal text. As a result we got an error like this in one of the places relying on markdown content (viz ellie above):

Problem at row 18 No entity named "& Resources - In-progress notification for downloads in Code Studio - Added new generic drugs to primary lookup table - Library updates and `filter_patient_by_visit_type` to support new `service_types` from XXX asdfa ;" found.

Do you think it would be possible to tweak the entity parser so it's more strict and doesn't consider normal text as part of entity?

jhrcek commented 2 years ago

Gentle ping @dillonkearns Could you please comment what you think about possible ways to fix this? I could invest some time this weekend to implementing a fix. I'd just like to know your opinion on acceptable ways to fix this before I invest time into implementing it. My plan would be to lookup some html entity specification and tweak the parser to be more picky about which characters to munge.

dillonkearns commented 2 years ago

Hello @jhrcek, I believe the problem here is that your ampersand is intended to be a literal ampersand rather than an escape character, so it should be & rather than & (since it's in the context of raw HTML syntax).

When you do Html.p [] [ Html.text "Milk & Honey" ] the issue doesn't come up because Elm will escape it automatically. But in this case, you have control over the raw string which includes some HTML so it needs to be HTML escaped.

jhrcek commented 2 years ago

Your suggestion sounded promising, so I tried it in ellie: https://ellie-app.com/hW7tKmSVHcga1

Unfortunately both of these ways end with parser error

"<div>Milk & Honey</div>" gives error: Problem at row 1 No entity named "& Honey;" found. "<div>Milk & Honey</div>" gives error: Ran into a oneOf with no possibilities!

The second error coming from here https://github.com/dillonkearns/elm-markdown/blob/a483add6556c469d6d0e6d4a191593a5000114fd/src/Markdown/Html.elm#L94

Can you think of another way to get & rendered on a webpage when it's part of inline html block, or do you think that parser will need a fix?

bburdette commented 2 years ago

@jhrcek the defaultHtmlRenderer implements html parsing like this.

oneOf [], means no html keywords are implemented by default. See here for more.

In elm-markdown, the html-ish tags are intended to be a sort of extension mechanism to markdown. So you might add a tag <myvideo> that would provide a special renderer for embedded videos.

If you're looking for something to handle the full set of standard html tags, that's not provided by this lib. But possibly you could turn something up by searching github?

jhrcek commented 2 years ago

@bburdette thank you. I should have read through the docs more thorougly before reporting this issue on behalf of a colleague. Using & in combination with defaultHtmlRenderer tweaked by a list renderers for supported html tags made my example work. I'm closing this issue as I managed to make our usecase work.

dillonkearns / elm-markdown

Parser of html entities is munging too much input - returning errors when & is part of normal text #118