makeup_html produces different output when compared to Pygments

lkarthee commented 2 years ago

makeup_html produces different tokens when compared to Pygments .

Makeup styles html tags which are recognised using k which is keyword rather than styling it with nt which is name_tag.

<!-- produced by pygments site -->
<pre>
  <span></span>
  <span class="p">&lt;</span>
  <span class="nt">h1</span>
  <span class="na">alt</span>
  <span class="o">=</span>
  <span class="s">"blah"</span>
  <span class="p">&gt;</span>This is heading
  <span class="p">&lt;/</span>
  <span class="nt">h1</span>
  <span class="p">&gt;</span>
</pre>

<!-- produced by makeup_html, data-group-ids deleted-->
<pre class="highlight">
  <code>
    <span class="p">&lt;</span>
    <span class="k">h1</span> <!-- "k" instead of "nt" -->
    <span class="w"> </span>
    <span class="na">alt</span>
    <span class="o">=</span>
    <span class="s">&quot;blah&quot;</span>
    <span class="p">&gt;</span>
    <span class="s">This is a heading</span>
    <span class="p">&lt;/</span>
    <span class="k">h1</span> <!-- "k" instead of "nt" -->
    <span class="p">&gt;</span>
  </code>
</pre>

Makeup styles html tags which are not recognised using s which is string rather than styling it with nt which is name_tag. Also it styles unrecognised attributes with s which is string rather than na which is name_attribute.

<!-- produced by pygments site -->
<pre>
  <span></span>
  <span class="nt">&lt;.alert</span>
  <span class="na">primary=</span>
  <span class="s">"true"</span>
  <span class="nt">&gt;</span>This is heading 1
  <span class="nt">&lt;/.alert&gt;</span>
</pre>

<!-- produced by makeup_html , data-group-ids deleted -->
<pre class="highlight">
  <code>
    <span class="p">&lt;</span>
    <span class="s">.alert</span><!-- "s" instead of "nt" -->
    <span class="w"> </span>
    <span class="s">primary</span><!-- "s" instead of "na" -->
    <span class="o">=</span>
    <span class="s">&quot;true&quot;</span>
    <span class="p">&gt;</span>
    <span class="s">This is a heading</span>
    <span class="p">&lt;/</span>
    <span class="s">.alert</span><!-- "s" instead of "nt" -->
    <span class="p">&gt;</span>
  </code>
</pre>

Here are my queries after using Makeup html :

Is this intentional or a bug?
Any reason for lexing html in strictest possible way when the html language is extended by people in many ways like for example LiveView components, etc. It becomes a problem to use this lexer as alternative to Pygments.
Also can the data-group-ids can be deactivated or activated using a flag passed using options ?

josevalim commented 2 years ago

Hi @lkarthee, feel free to send PRs for those changes. I think aligning with pygments can be positive. This is mostly a community project, so if it can be made more useful, contributions are welcome!

javiergarea commented 2 years ago

Hi @lkarthee, thank you so much for providing your feedback. ❤️

This project has been experimental as it has been developed with the single objective of implementing an HTML lexer for makeup. That's one of the reasons I decided to follow HTML5 syntax as it seemed easier at first. As @josevalim said, any contribution is gladly welcome and the three of your concerns (i.e., aligning with pygments, weakening the syntax and deactivation of data-group-ids via flags) are quite interesting features for the project IMHO.

lkarthee commented 2 years ago

Thank you @josevalim and @javiergarea - I appreciate your effort in maintaining this project.

I am using a fork with changes I made in a project(bootstrap library for phoenix components). I am fixing some bugs with attributes highlighting - i will send pr, once those changes are stable? (May be in a week).

For now I am using with following changes:

def not_keywords_stringify(tokens) do
    not_keywords_stringify(tokens, {0, []}, [])
  end

  def skip_whitespace(tokens, token) do
    queue = 
      Enum.reduce_while(tokens, [token], fn t, acc -> 
        case t do
          {:string , tup, list} ->
            {:halt, acc ++ [{:name_tag, tup, list}]}

          {:keyword, tup, list} ->
            {:halt, acc ++ [{:name_tag, tup, list}]}

          _ ->
            {:cont, acc ++ [t]}

        end
      end)
    {_, tokens} = Enum.split(tokens, length(queue) - 1)
    {queue, tokens}
  end

  def not_keywords_stringify(
    [{:punctuation, _, "<"} = token | tokens],
    {id, []} = queue_tuple,
    result) do
    {queue, tokens} = skip_whitespace(tokens, token)
    not_keywords_stringify(tokens, {id + 1, []}, result ++ queue)
  end

  def not_keywords_stringify(
    [{:punctuation, _, "<"} = token | tokens],
    {id, orig_queue} = queue_tuple,
    result) do
    {queue, tokens} = skip_whitespace(tokens, token)
    not_keywords_stringify(tokens, {id + 1, []}, result ++ orig_queue ++ queue)
  end

  def not_keywords_stringify(
    [{:punctuation, _, "</"} = token | tokens],
    {id, []} = queue_tuple,
    result) do
    {queue, tokens} = skip_whitespace(tokens, token)
    not_keywords_stringify(tokens, {id + 1, []}, result ++ queue)
  end

  def not_keywords_stringify(
    [{:punctuation, _, "</"} = token | tokens],
    {id, orig_queue} = queue_tuple,
    result) do
    {queue, tokens} = skip_whitespace(tokens, token)
    not_keywords_stringify(tokens, {id + 1, []}, result ++ orig_queue ++ queue)
  end

  def not_keywords_stringify(
    [{:punctuation, _, ">"} = token | tokens],
    {id, queue} = queue_tuple,
    result) do
    {queue, _} =
      Enum.reduce(queue, {[], nil},fn {type, mid, data} = curr, {acc, prev} -> 
        prev_type = 
          case prev do
            nil ->
              nil

            {prev_type, _, _} ->
              prev_type

          end

        cond do
          prev_type == nil and type == :string ->
            {acc ++ [{:name_attribute, mid, data}], curr}

          prev_type == :whitespace and type == :string ->
            {acc ++ [{:name_attribute, mid, data}], curr}

          true ->
            {acc ++ [curr], curr}

        end
      end)
    not_keywords_stringify(tokens, {id, []}, result ++ queue ++ [token])
  end

  def not_keywords_stringify([token | tokens] , {id, queue} = queue_tuple, result) do
    not_keywords_stringify(tokens, {id, queue ++ [token]}, result)
  end

  def not_keywords_stringify([], {_id, queue}, result),
    do: result ++ queue

  @impl Makeup.Lexer
  def postprocess(tokens, _opts \\ []) do
    tokens
    |> char_stringify()
    |> commentify()
    |> keyword_stringify()
    |> attributify()
    |> element_stringify()
    |> not_keywords_stringify()
  end

elixir-makeup / makeup_html

makeup_html produces different output when compared to Pygments #3