PuerkitoBio / goquery

A little like that j-thing, only in Go.
BSD 3-Clause "New" or "Revised" License
13.99k stars 918 forks source link

A edgecase of DOM's Text() function #242

Closed sharmi closed 6 years ago

sharmi commented 6 years ago

Hi, Thank you for this wonderful library as it is one of building blocks of the go-colly crawler, which I use.

I have been running into a corner case of late. Consider this piece of html.

<span>The Item Name<p>Some Content about the item.</span> Though there is no space between the two text items, yet these are rendered with proper spacing in the browser because of the p tag.

Unfortunately in the Text() function the output ends up merged without space like this The Item NameSome Content about the item.

There are multiple cases where the spacing between text are not accounted for in code but rendered properly in the browser because of html tags or css.

Is it possible to have another function TextWithSeparator(sep string) which takes a separator as input and appends that separator after each node.Data? Text() function could be rewritten as a call to TextWithSeparator with empty string for input.

Text() = TextWithSeparator("")

I am a golang novice but I am willing to implement this if you agree. If there is a better way to handle this, I would like to know about it too.

thank you

mna commented 6 years ago

Hello,

Thanks for the kind words! Regarding the Text() behaviour, you're right, and this is something that comes up quite a bit. However, I think there is no one right way to solve this in goquery (see my comment on this topic in this closed PR: https://github.com/PuerkitoBio/goquery/pull/239#issuecomment-372057754).

I think a neat way to solve this would be to implement a pretty-printer/html formatter package. I started working on something, it's too early to say if it will be finished and released, but basically it would take an *html.Node and format the tree based on a config (i.e. pretty-print HTML, minify HTML, print only the text with e.g. newlines for block elements and spaces for inline elements, etc.).

In the meantime, for a quick & simple solution that just inserts a space between text nodes, you can recursively process the Text-type *html.Nodes of the *goquery.Selection, writing the text in a bytes.Buffer (or strings.Builder if you're on Go1.10) and adding a space after each write. Of course you may end up with multiple spaces if there was already space around the text, but if that's a problem you can trim the text prior to writing it.

Hope this helps, Martin

sharmi commented 6 years ago

Hi Martin,

I believe what you say makes sense. I am closing this request.

Meanwhile, I just retrieving the inner html of the html.Node passing it to the html2text library for a decent representation.