flaxsearch / flaxcode

Automatically exported from code.google.com/p/flaxcode
4 stars 1 forks source link

remote_filter ifilter_filter ram use #71

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Richard reports that on windows the process that runs the ifilter keeps
uses increasing amounts of memory through time.

The remote_filter code already terminates the filter running process (and
starts a new one) if it takes too long to filter a document, so one
workaround might be to periodically kill the process and start a new one.

Ideally it would be good to understand why this is happening, and, if
possible fix it.

Original issue reported on code.google.com by paul.x.r...@googlemail.com on 31 Oct 2007 at 9:15

GoogleCodeExporter commented 9 years ago
I notice in the ifilter filter that an Init() method is called for each 
ifilter.  Is
there a corresponding cleanup method which we should be calling, perhaps?

Original comment by boulton.rj@gmail.com on 31 Oct 2007 at 9:30

GoogleCodeExporter commented 9 years ago
This may have been the cause of the indexer breaking last night after around 
20,000
files: subsequent attempts to index PDFs gave:
2007-10-31 01:45:59,418: ERROR: Filtering file:
D:\Flax-development\testfiles\www.opsi.gov.uk\acts\acts2006\related\ukpgatod_200
60043_en.pdf
with filter: <indexserver.remote_filter.RemoteFilterRunner object at 0x00BA1B10>
raised exception (-2147467259, 'Unspecified error', None, None), skipping

Original comment by charliej...@gmail.com on 31 Oct 2007 at 11:16

GoogleCodeExporter commented 9 years ago
I've made a few experiments. I suspect that the memory problem is something to 
do
with the adobe pdf IFilter. See
http://flaxcode.googlecode.com/svn/trunk/src/test/issue71.py

Original comment by paul.x.r...@googlemail.com on 31 Oct 2007 at 6:27

GoogleCodeExporter commented 9 years ago
Hmm. This page (admittedly from a competing product) says Adobe's latest 
IFilter leaks:
http://markharrison.co.uk/blog/2007/05/foxit-pdf-ifilter-x64-and-32-bit.htm

Original comment by charliej...@gmail.com on 31 Oct 2007 at 9:56

GoogleCodeExporter commented 9 years ago
We need to do something about this for 1.0.

The simplest solution is to restart every N documents (where N is, say, 1000).

A better solution might be to monitor the memory usage of the subprocess, and 
restart
it if it gets above a certain value.  I have some python which can help with 
this,
and will experiment.  However, leaving this issue assigned to paul, who should
implement the simple "restart every N docs" solution for now, then please 
reassign
this bug to me.

Original comment by boulton.rj@gmail.com on 1 Nov 2007 at 4:50

GoogleCodeExporter commented 9 years ago
the every N docs is done  - reassigning to Richard.

Original comment by paul.x.r...@googlemail.com on 1 Nov 2007 at 5:15

GoogleCodeExporter commented 9 years ago
We're happy to leave the memory monitoring part of this to 1.1 (though there's a
chance I'll get it done before that).

Original comment by boulton.rj@gmail.com on 1 Nov 2007 at 5:56

GoogleCodeExporter commented 9 years ago

Original comment by boulton.rj@gmail.com on 2 Nov 2007 at 12:59

GoogleCodeExporter commented 9 years ago

Original comment by charliej...@gmail.com on 19 Aug 2009 at 3:28