flaxsearch / flaxcode

Automatically exported from code.google.com/p/flaxcode
4 stars 1 forks source link

htmltotext memory leak? #192

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1. Run this python script with the attached HTML file as input:

import sys
import htmltotext

f = open(sys.argv[1])
html = f.read()
htmltotext.extract(html)

What is the expected output? What do you see instead?

Expect the script to terminate with no output. Instead, the script hangs -
'top' reveals the memory usage growing alarmingly. Eventually, the kernel
will kill the process.

What version of the product are you using? On what operating system?

htmltotext 0.7

Linux newcastle 2.6.18-53.1.14.el5 #1 SMP Tue Feb 19 07:18:46 EST 2008
x86_64 x86_64 x86_64 GNU/Linux

Please provide any additional information below.

The attached HTML is 3.2MB - there are rather a lot of links in it, but,
for example, the links text browser handles it fine.

Original issue reported on code.google.com by tom...@metahusky.net on 29 Jul 2008 at 3:28

Attachments:

GoogleCodeExporter commented 9 years ago
This sounds more like a bug in the parser than a memory leak.  The testcase 
should be
enough to reproduce and fix it, though - prod me if I don't get to it within 
the next
couple of days.

Original comment by boulton.rj@gmail.com on 29 Jul 2008 at 5:21

GoogleCodeExporter commented 9 years ago
It was a bug in the parser - the parser didn't know about empty tags, so was 
trying
to make a huge list of "br" tags be the parent of each "a" tag.  This was using 
up
vast quantities of memory.

I've fixed the parser to understand the list of standard html tags which are 
empty,
and not to use up lots of memory when parsing them.  Invalid html could still 
cause a
large waste of memory, so it would still be good to improve the parser to avoid 
this
happening.  However, the immediate problem is fixed (with htmltotext release 
0.7.2),
so marking this issue as such.

Original comment by boulton.rj@gmail.com on 29 Jul 2008 at 7:44