cheeriojs / cheerio

The fast, flexible, and elegant library for parsing and manipulating HTML and XML.
https://cheerio.js.org
MIT License
28.55k stars 1.64k forks source link

text() method merges text parts #3457

Open kvetoslavnovak opened 11 months ago

kvetoslavnovak commented 11 months ago

text() method in JQuery keeps whitespaces between text parts. Cheerio unfortunatelly merges text parts from tags thogether givign you quite gibberish words.

See exampel https://api.jquery.com/text/ I tested it by myself and JQuery text() method keeps whitespaces between the text parts of the tags. This is very useful.

Cheerio on the other hand messes text parts alltogther without no spaces which is a pain.

E.g. from this

TITLE>
<TI>
<P>PART ONE</P>
</TI>
<STI>
<P>
<HT TYPE="BOLD">GENERAL PROVISIONS</HT>
</P>
</STI>
</TITLE>

JQuery text() gives you expected text PART ONE GENERAL PROVISIONS but Cheerio gives you PART ONEGENERAL PROVISIONS

Also standard HTML innerText property works the same.

kvetoslavnovak commented 11 months ago

OK. This seemt to do the trick:

cheerio(elem).html().replaceAll(/<\/?[a-zA-Z0-9=" ]*>/g, ' ').replace(/\s\s+/g, ' ').trim())

But better would if cheerio keep the whitespace which can ve trime manually as user needs to.

apsquared commented 10 months ago

Facing the same issue and hitting it via langchain CheerioWebLoad so I cant use the approach above.

Properko commented 9 months ago

I believe this is related https://github.com/cheeriojs/cheerio/issues/2841 and really unfortunate 😬 That regex can get heavy for huge htmls