gawel / pyquery

A jquery-like library for python
http://pyquery.rtfd.org/
Other
2.29k stars 182 forks source link

In new version 1.4.0, when extract text which have newline, the newline character is omitted #181

Closed WindSoilder closed 6 years ago

WindSoilder commented 6 years ago

Hi, I found there are different behavior between 1.4.0 and previous version 1.3.0 Take the following html code as example:

<code>import json
 j = json.loads('{"one" : "1", "two" : "2", "three" : "3"}')
print j['two']
</code>

and given the relative code to output the content text of code element (I save the html code into 'test2.html' file):

from pyquery import PyQuery as pq

with open('test2.html', 'rb') as f:
    c = f.read()
html = pq(c)
code = html('code')
print(code.eq(0).text())

In the new version 1.4.0, the output is compressed without new lines(which is not happened in 1.3.0):

import json j = json.loads('{"one" : "1", "two" : "2", "three" : "3"}') print j['two']

Can anyone tell me is that the new feature in version 1.4.0? Thanks very much :)

gawel commented 6 years ago

There's a major refactoring of .text() in 1.4 (that's why it's 1.4 and not 1.3x+1)

It looks like code is marked as inline node: https://github.com/gawel/pyquery/blob/master/pyquery/text.py#L22

Maybe it should not. You can try to remove it.

Also there's some test to compare firefox output to pyquery output. We can add a test to check if there's a diff here

WindSoilder commented 6 years ago

Thanks for quickly reply.

But sadly....remove it doesn't work :( May be I have not describe the issue clearly, what I mean is that the new-line in the text of element is omitted. In version 1.3.0, the output of test code is this:

import json
j = json.loads('{"one" : "1", "two" : "2", "three" : "3"}')
print j['two']

Which have retain the newline character in the text of element. And in version 1.4.0, the output is this:

import json j = json.loads('{"one" : "1", "two" : "2", "three" : "3"}') print j['two']

Which have omitted the newline in the text.

WindSoilder commented 6 years ago

Sorry for the disturb, I have found a way to make the newline in the text print out again :) I can use code.eq(0).text(squash_space=False)

gawel commented 6 years ago

Cool. I'm closing this then.