Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.79k stars 273 forks source link

How to emulate 2018.1.9 parsing behavior? #292

Closed rico666 closed 4 years ago

rico666 commented 4 years ago

After making the big mistake to use html2text in one of my projects...

... I now face the following problem(s):

2019.8.11 removed the ability to retrieve text documents. Ok, one can workaround this with using wget, curl or whatever and give html2text just the retrieved document.

Unfortunately, with the default settings, html2text 2018.1.9 and 2019.8.11 also differ in their parsing result significantly. 2018.1.9 prints out the URLs in the document as absolute, while 2019.8.11 prints them out as relative, etc.

My problem (and mistake) is to have relied on the output of 2018.1.9 also for historic data, so there is now over a year of 2018.1.9 parsed texts on my disk and it looks like I'm not able to get 2019.8.11 to deliver the same (structurally) parsed text format.

Document retrieval aside, is there a set of compatibility params for 2019.8.11 to behave as 2018.1.9 did by default?

jdufresne commented 4 years ago

Unfortunately, with the default settings, html2text 2018.1.9 and 2019.8.11 also differ in their parsing result significantly. 2018.1.9 prints out the URLs in the document as absolute, while 2019.8.11 prints them out as relative, etc.

Do you know which commit this changed in? I don't see this mentioned in the changelog.

jdufresne commented 4 years ago

Can you provide the command line or minimal script to demonstrate the issue? If it produces output, can you be explicit as to what you expect? Thanks.

rico666 commented 4 years ago

I used html2text2 CLI script shipped with Arch Linux.

By default params I mean really default like

html2text2 https://coinmarketcap.com/all/views/all/ > cmc.txt

So for 2019.8.11 the fetch functionality I replaced with

wget --quiet https://coinmarketcap.com/all/views/all/ && html2text2 index.html > cmc.txt

but delivers a completely different txt file.

I should mention that it is a non-issue for me anymore, as I wrote my own HTML::DOM parser (which I should have done in the first place), but while I agree that updating the code base is important, I feel that deprecating/changing behavior in that process is a double edged sword that can cost you users.