gawel / pyquery

A jquery-like library for python
http://pyquery.rtfd.org/
Other
2.3k stars 182 forks source link

.text() should exclude <script> tags #111

Closed est closed 8 years ago

est commented 9 years ago
>>> pyquery.PyQuery('<div>hello<script language="javascript">alert(0)</script>, world</div>').text()
'hello alert(0) , world'

Is this by design? If so, is there a way to get rid of <script> tags during text() ?

est commented 9 years ago

workaround for lxml in case anyone needed

' '.join(x.text for x in elem.xpath('.//*[not(self::script or self::style)]') if x.text)
neumond commented 8 years ago

Why should it exclude scripts? What if I want script tag content? I guess it's better to remove scripts before getting text() in your case:

pq('script').remove()
pq.text()
est commented 8 years ago

@neumond well your reply made me speechless. Hope you enjoy the scripts and style declarations in your text.

Thanks for the solution though, works well.

neumond commented 8 years ago

You probably don't know that jquery's text() method acquires scripts and styles along with normal text.

As far as I know jquery does exclude scripts, but in another case, when you assign innerHTML. It is intended to guarantee some degree of safety. Reexecuting scripts by assigning innerHTML causes hard-to-catch bugs.

gawel commented 8 years ago

I agree that be able to retrieve scripts can be usefull. for example to extract some json included in the html (who use apis?)