aaronsw / html2text

Convert HTML to Markdown-formatted text.
http://www.aaronsw.com/2002/html2text/
GNU General Public License v3.0
2.63k stars 414 forks source link

Memory leaks #78

Open ghost opened 11 years ago

ghost commented 11 years ago

Hi guys,

I really couldn't figure out the exact reason of the problem, but the fact is that I'm using your code for processing around 5,000 HTML documents and my RAM is getting filled quickly. I'm 100% sure that it's your code because I replaced it for a simple HTML tags removal and the leak was gone.

Sorry for not being more informative, but I guess it's pretty easy to set an experiment yourselves.

mcepl commented 10 years ago

How large those documents are? Is it number of documents (which shouldn't make a difference if you are running one at the time) or their size?

ghost commented 10 years ago

Hi,

Once again, sorry for not being so useful, but this is the deal:

I had a set of around 4.000 epubs. I opened them and extracted all the html files and used your software for transforming it to text, and processing the text later. In total I guess each epub had around 10 htmls inside, so we are talking about 40.000 files that I was processing in parallel using 6 processors. The processing was for indexing the documents, so for each HTML I transformed it and ran once through the text and then I discarded it, you'll have to trust me that I was making extra sure of closing all the files and removing everything after processing them (so I'm 100% sure the memory leak was not on my side of the code).

The deal is that with your software the whole thing was consuming my RAM (8GB), so then I decided not to use your code and just to remove HTML markups and process the text like that (with the problem that then I was indexing some CSS code, but it was not such a big deal for my program's logic). After I did that the memory leak disappeared and the whole processing was done using something like 1GB of RAM.

Once again, trust me I doubled and tripled checked that the memory leak was not on my side of the code, neither that I was calling your code improperly. My final and definite conclusion is that your code had a memory leak.

I think it's not hard for you to set up an experiment yourself for testing this. Get a bunch of HTML docs, transform them to text, and then remove them. By the way, I'm also sure it was not a problem with the garbage collector.

Cheers

2013/11/7 Matěj Cepl notifications@github.com

How large those documents are? Is it number of documents (which shouldn't make a difference if you are running one at the time) or their size?

— Reply to this email directly or view it on GitHubhttps://github.com/aaronsw/html2text/issues/78#issuecomment-27974904 .

mcepl commented 10 years ago

BTW, it is not "my code" ... that was just a random drive-by comment. Is the script you have created (or at least substantial part) available somewhere?

ghost commented 10 years ago

Oh man, sorry, I thought u were the code's developer.

And regarding the code, I can't publish it sorry. Besides after I decided to stop using this code my program changed a lot, so even if I gave it to you it would be hard to get the part were I used to use it. However, as I said before, it's really not hard to set up an experiment to check the memory leak (just do an infinite loop and parse the same document many times, you'll see that your RAM will slowly get consumed).

2013/11/8 Matěj Cepl notifications@github.com

BTW, it is not "my code" ... that was just a random drive-by comment. Is the script you have created (or at least substantial part) available somewhere?

— Reply to this email directly or view it on GitHubhttps://github.com/aaronsw/html2text/issues/78#issuecomment-28090506 .

mcepl commented 10 years ago

Well, the problem with this project is that upstream is dead (quite literally in this case unfortunately) so we are all waiting for the resolution of the succession. I am trying to salvage bits and pieces of further development in my own repo but I don't feel like doing any deep changes before the new maintainer arrives.

Alir3z4 commented 10 years ago

Related issue: https://github.com/Alir3z4/html2text/issues/13