JuliaWeb / Gumbo.jl

Julia wrapper around Google's gumbo C library for parsing HTML
Other
154 stars 25 forks source link

Gumbo strips away template tags #86

Closed essenciary closed 1 year ago

essenciary commented 4 years ago

Ex:

julia> parsehtml("<template v-slot:avatar><q-icn name='moo' /></template>")
HTML Document:
<!DOCTYPE >
HTMLElement{:HTML}:<HTML>
  <head></head>
  <body></body>
</HTML>

vs

julia> parsehtml("<templatee v-slot:avatar><q-icn name='moo' /></templatee>")
HTML Document:
<!DOCTYPE >
HTMLElement{:HTML}:<HTML>
  <head></head>
  <body>
    <templatee v-slot:avatar="">
      <q-icn name="moo"></q-icn>
    </templatee>
  </body>
</HTML>

Gumbo: 0.8.0

julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, icelake-client)

Thanks

essenciary commented 4 years ago

Can anybody comment on this please? It's a major pain when building HTML frontends the <template> tag is key. I've been trying a million things to keep it in the DOM but nothing seems to work. Not sure if this is normal behaviour - it's true that browsers don't render <template> on the page but it's in the DOM.

Alternatively, maybe having the option/config to replace/remove :template from the TAGS collection - that would make it unknown and would just leave it.

Any advice would be appreciated. Thanks!

aviks commented 4 years ago

Can you test this with the C library, and see what the behaviour is there? In general, we're only a very thin layer around the C library. [Edit] Not needed any more. Results of my analysis are below.

aviks commented 4 years ago

This also looks relevant: https://github.com/google/gumbo-parser/blob/aa91b27b02c0c80c482e24348a457ed7c3c088e0/src/gumbo.h#L304

aviks commented 4 years ago

Ok, so having investigated this, what we need is something along the lines of #75, but for TemplateNodes.

In CGumbo.jl we need to define const TEMPLATE = Int32(6). Then pass a preserve_template optional argument through the parsing options like in #75. Finally, in conversion.jl#load_node function, check the type and this argument to load the node. The actual node type (according to the gumbo .h file linked above) should be CGumbo.Node{CGumbo.Element}

essenciary commented 4 years ago

Oh, awesome findings, thank you @aviks !

clarkevans commented 4 years ago

Is this still an issue? I need to be using template tags.

aviks commented 4 years ago

Yes, I don't think we've got a PR for this, and I've not had the time to implement this myself. The strategy described in https://github.com/JuliaWeb/Gumbo.jl/issues/86#issuecomment-670465114 is what needs doing.

hhaensel commented 1 year ago

@aviks your hint to #75 was perfect. I just submitted a PR.