gawel / pyquery

A jquery-like library for python
http://pyquery.rtfd.org/
Other
2.3k stars 182 forks source link

Entity bug #178

Open ArthaTi opened 6 years ago

ArthaTi commented 6 years ago

That's my third issue in a row, sorry if that's annoying. I understand it's not easy to make such a library, I would help if I could...

Here's a simple script this issue occurs in:

p = pq("<span>&lt;foo&gt;&lt;bar&gt;</span>")
print(p("span").html(), p("span").text())
p = pq("<span><b>&lt;foo&gt;</b>&lt;bar&gt;</span>")
print(p("span").html(), p("span").text())

Output:

<foo><bar> <foo><bar>
<b>&lt;foo&gt;</b>&lt;bar&gt; <foo> <bar>

while it should be

&lt;foo&gt;&lt;bar&gt; <foo><bar>
<b>&lt;foo&gt;</b>&lt;bar&gt; <foo><bar>

Basically, if there are entities on the beginning and end of the selected element, then entities are decoded, even in HTML, when they shouldn't... Possible reason? Tried to search for it, but I really don't understand the code... Thanks.

ArthaTi commented 6 years ago

Perhaps it's here:

if not children:
                return tag.text

tag.text is unencoded, while this function shouldn't return unencoded. Maybe just return it encoded?

I would fix it, but I don't see any function here for encoding/decoding entities in the project... how do I do it? I only know that there's one in html built-in library

gawel commented 6 years ago

I guess there's no such thing because lxml do that