Open ghost opened 11 years ago
How large those documents are? Is it number of documents (which shouldn't make a difference if you are running one at the time) or their size?
Hi,
Once again, sorry for not being so useful, but this is the deal:
I had a set of around 4.000 epubs. I opened them and extracted all the html files and used your software for transforming it to text, and processing the text later. In total I guess each epub had around 10 htmls inside, so we are talking about 40.000 files that I was processing in parallel using 6 processors. The processing was for indexing the documents, so for each HTML I transformed it and ran once through the text and then I discarded it, you'll have to trust me that I was making extra sure of closing all the files and removing everything after processing them (so I'm 100% sure the memory leak was not on my side of the code).
The deal is that with your software the whole thing was consuming my RAM (8GB), so then I decided not to use your code and just to remove HTML markups and process the text like that (with the problem that then I was indexing some CSS code, but it was not such a big deal for my program's logic). After I did that the memory leak disappeared and the whole processing was done using something like 1GB of RAM.
Once again, trust me I doubled and tripled checked that the memory leak was not on my side of the code, neither that I was calling your code improperly. My final and definite conclusion is that your code had a memory leak.
I think it's not hard for you to set up an experiment yourself for testing this. Get a bunch of HTML docs, transform them to text, and then remove them. By the way, I'm also sure it was not a problem with the garbage collector.
Cheers
2013/11/7 Matěj Cepl notifications@github.com
How large those documents are? Is it number of documents (which shouldn't make a difference if you are running one at the time) or their size?
— Reply to this email directly or view it on GitHubhttps://github.com/aaronsw/html2text/issues/78#issuecomment-27974904 .
BTW, it is not "my code" ... that was just a random drive-by comment. Is the script you have created (or at least substantial part) available somewhere?
Oh man, sorry, I thought u were the code's developer.
And regarding the code, I can't publish it sorry. Besides after I decided to stop using this code my program changed a lot, so even if I gave it to you it would be hard to get the part were I used to use it. However, as I said before, it's really not hard to set up an experiment to check the memory leak (just do an infinite loop and parse the same document many times, you'll see that your RAM will slowly get consumed).
2013/11/8 Matěj Cepl notifications@github.com
BTW, it is not "my code" ... that was just a random drive-by comment. Is the script you have created (or at least substantial part) available somewhere?
— Reply to this email directly or view it on GitHubhttps://github.com/aaronsw/html2text/issues/78#issuecomment-28090506 .
Well, the problem with this project is that upstream is dead (quite literally in this case unfortunately) so we are all waiting for the resolution of the succession. I am trying to salvage bits and pieces of further development in my own repo but I don't feel like doing any deep changes before the new maintainer arrives.
Related issue: https://github.com/Alir3z4/html2text/issues/13
Hi guys,
I really couldn't figure out the exact reason of the problem, but the fact is that I'm using your code for processing around 5,000 HTML documents and my RAM is getting filled quickly. I'm 100% sure that it's your code because I replaced it for a simple HTML tags removal and the leak was gone.
Sorry for not being more informative, but I guess it's pretty easy to set an experiment yourselves.