JuliaWeb / Gumbo.jl

Julia wrapper around Google's gumbo C library for parsing HTML
Other
154 stars 25 forks source link

Information about node position in parent #73

Closed drrajeshtalluri closed 4 years ago

drrajeshtalluri commented 4 years ago

Hi, I was trying to use Gumbo.jl to manipulate the DOM by changing tags and content of nodes. However, I am confused with some code.

doc = parsehtml("""
<html>
    <head>
        <title>Title</title>
    </head>
    <body>
        <span>this is a span 1</span>
        <div>
        <h1>this is a heading</h1>
        <span>this is a span 2</span>
        </div>
    </body>
</html>
""");

If I wanted to change/ replace the first <span> node to a <abc> node

julia> elem = doc.root[2][1]
HTMLElement{:span}:
<span>
  this is a span 1
</span>

The following does not work

julia> elem = HTMLElement{:abc}(elem.children,elem.parent,elem.attributes)
HTMLElement{:abc}:
<abc>
  this is a span 1
</abc>

julia> doc
HTML Document:
<!DOCTYPE >
<HTML>
  <head>
    <title>
      Title
    </title>
  </head>
  <body>
    <span>
      this is a span 1
    </span>
    <div>
      <h1>
        this is a heading
      </h1>
      <span>
        this is a span 2
      </span>
    </div>
  </body>
</HTML>

I have to assign this in the parents children for it to work. For which I have to know the position of the node in the parent node.

elem.parent.children[1] = HTMLElement{:abc}(elem.children,elem.parent,elem.attributes)

julia> doc
HTML Document:
<!DOCTYPE >
<HTML>
  <head>
    <title>
      Title
    </title>
  </head>
  <body>
    <abc>
      this is a span 1
    </abc>
    <div>
      <h1>
        this is a heading
      </h1>
      <span>
        this is a span 2
      </span>
    </div>
  </body>
</HTML>

Is this behavior intended? I thought both parents children and the child node should point to the same location.

Also I see that the information about position in parent is present in index_within_parent. I was wondering if it would be possible to add this information for each node in addition to parents children and attributes. If we have this information then we could overcome the above issue.

struct Node{T}
    gntype::Int32  # enum
    parent::Ptr{Node}
    index_within_parent::Csize_t
    parse_flags::Int32  # enum
    v::T
end

Please let me know your thoughts or if I am approaching this entirely in the wrong direction. Is there a more straight forward way to manipulate the Nodes?

aviks commented 4 years ago

Hi @drrajeshtalluri you'll probably have more luck getting answers to your questions on our discourse or slack channels. See here for more: https://julialang.org/community/

Github issues are useful when you are reporting a bug.