TeamHG-Memex / html-text

Extract text from HTML
MIT License
130 stars 24 forks source link

Blank lines created by <br> cannot be parsed correctly #23

Open luyuhuang opened 4 years ago

luyuhuang commented 4 years ago

Hi all,

When I try to convert the following html to plain text:

<div>aaa</div>
<br>
<div>bbb</div>

the output is

aaa
bbb

but I think there should be a blank line between aaa and bbb. I try to read the code and found that the blank line created by <br> is ignored because of context.prev is _NEWLINE(created by the previous <div>). Is there a way to solve this problem? Thank you very much.

lopuhin commented 4 years ago

I agree that an extra newline makes sense here 👍

printROSHN commented 1 year ago

Is this issue still open , can i have a go on it ?

lopuhin commented 1 year ago

@printROSHN yes, that would be great!