RobertDober / earmark_parser

The Markdown to AST part of Earmark.
Apache License 2.0
68 stars 26 forks source link

Parse markdown within html tags #119

Closed LostKobrakai closed 2 months ago

LostKobrakai commented 2 years ago

https://babelmark.github.io/?text=%3Csmall%3ESome+%5BGoogle%5D(https%3A%2F%2Fwww.google.com%2F).%3C%2Fsmall%3E

Markdown within html tags doesn't seem to be parsed/converted.

gitneko commented 2 months ago

This feature is needed for reduced content (<details>), as for example GitHub Flavoured Markdown supports it.

For example the following gets parsed as a single string and will end up for example in ex_doc as raw text:

<details>
<summary>Click to expand</summary>

### Header

More markdown, for example `this`.
</details>

IEx:

iex(1)> string = "<details>\n<summary>Click to expand</summary>\n\n### Header\n\nMore markdown, for example `this`.\n</details>\n"
iex(2)> EarmarkParser.as_ast(string, gfm: true)
{:ok,
 [
   {"details", [],
    ["<summary>Click to expand</summary>\n\n### Header\n\nMore markdown, for example `this`."],
    %{verbatim: true}}
 ], []}
iex(3)> EarmarkParser.as_ast(string)
{:ok,
 [
   {"details", [],
    ["<summary>Click to expand</summary>\n\n### Header\n\nMore markdown, for example `this`."],
    %{verbatim: true}}
 ], []}

In ex_doc rendered HTML: image

What would need to be done to get this feature supported?

RobertDober commented 2 months ago

IIUC as an intermediate solution we could also parse lines in the form

<(.*?)>(.*)</\1> as html, would that be enough to fix? I guess I can do this quickly, well I am a little bit blocked in Germany (not far from you actually, Chiemsee, LOL) waiting to access Austria because of Boris, but maybe within a week or so.

BTW all myy best wishes to the Elixir community in Poland!!!!!

RobertDober commented 2 months ago

Sorry OP was from Augsburg, @gitneko does not reveal her location, I respect this of course

gitneko commented 2 months ago

IIUC as an intermediate solution we could also parse lines in the form

<(.*?)>(.*)</\1> as html, would that be enough to fix? I guess I can do this quickly, well I am a little bit blocked in Germany (not far from you actually, Chiemsee, LOL) waiting to access Austria because of Boris, but maybe within a week or so.

BTW all myy best wishes to the Elixir community in Poland!!!!!

I don't really mind how it's implemented as long as we get to a good solution. :)

FWIW I think the proper way is to modify the parser and use a white- or blacklist to manage which HTML tags get actually parsed as Markdown (i.e. script and style shouldn't be processed) at this stage: https://github.com/RobertDober/earmark_parser/blob/c4f115ce38154993728e675628c3db3e2b617e83/lib/earmark_parser/parser.ex#L307-L332

RobertDober commented 2 months ago

Hmm I guess I need to look at the code as I do not really understand what you want if it is not to parse <(.*?)>(.*)</\1> ?

Maybe could you make a PR with some failing test cases?

RobertDober commented 2 months ago

For now I understand the following

  1. You need the markdown parsed inside HTML
  2. You need the <(.*?)>(.*)</\1> parsed correctly too, and that is probably implied by 1 anyway

I do not know about whitelisting though?

RobertDober commented 2 months ago

Ok I have looked into it and I would prefer not to implement this before 1.5. You say this feature is needed, but for whom is it needed, do we have this kind of html in ex_docs frequently?

RobertDober commented 2 months ago

Sorry cannot do this in the current version in any reasonable time