Replace existing IFilter for HTML and plain text files

GoogleCodeExporter commented 9 years ago

The current system uses IFilters for everything; this is very slow and for
HTML and plain text files we can almost certainly do better (tests on a 1GB
P4  2.5GHz only index 200 HTML files a second). If it's simple to do we
should use an alternative filter for these filetypes.

Original issue reported on code.google.com by charliej...@gmail.com on 22 Nov 2007 at 5:15

GoogleCodeExporter commented 9 years ago

At one point we did have something on the options page to configure which is 
used for
a given file type. Maybe we should reinstate that, at least to choose in simple 
cases
like that one?

It's hard to compare different filters if you have to change the code to do 
so...

Original comment by paul.x.r...@googlemail.com on 22 Nov 2007 at 5:30

Changed state: Clarify

GoogleCodeExporter commented 9 years ago

The other thing that occurs to me is we don't really know how much of the issue 
is to
do with the remote nature of the filter and how much is to do with the actual 
filter.
We could run (some)  IFilters in process - which would certainly speed things 
up.
There is of course a worry about the filter bringing the whole process down, 
but as
far as I know we have only really seen problems with the adobe ifilter so far.

Original comment by paul.x.r...@googlemail.com on 22 Nov 2007 at 5:58

GoogleCodeExporter commented 9 years ago

By all means we should do some profiling and tests - but we absolutely mustn't 
try to
change the UI at this stage.  Maybe after 1.0.

Original comment by boulton.rj@gmail.com on 22 Nov 2007 at 8:42

GoogleCodeExporter commented 9 years ago

Oh - also, I don't like the idea of bringing the ifilter stuff in-process at 
all; it
may be safe for some IFilters, but it's definitely not for others, and we have 
no
control over which IFilters are installed on the machines.  (We can, and do, 
suggest
changes in the installer, but we can't control things, and we can't really do
anything about filters installed after flax is installed, for now.)

It's a bit of a shame if flax is slower than we'd like for some filetypes - 
it's a
disaster if it's unstable and unable to reliably complete indexing runs.

Original comment by boulton.rj@gmail.com on 22 Nov 2007 at 8:44

GoogleCodeExporter commented 9 years ago

(That said, I'm not against testing with the ifilter stuff in-process to see 
how it
affects performance - I'm just against releasing anything which does that, at 
present.)

Original comment by boulton.rj@gmail.com on 22 Nov 2007 at 8:45

GoogleCodeExporter commented 9 years ago

I think for now it would be best if we just mod the code directly to enable a
different plain text and HTML filter. If we can do this reasonably quickly I 
can run
some comparative speed tests over the weekend...

Original comment by charliej...@gmail.com on 23 Nov 2007 at 9:56

GoogleCodeExporter commented 9 years ago

Quick comparison on my machine using the hmso data set with just html file type 
selected:

remote ifilter: 1 hour 14 mins
in process ifilter: 1 hour 23 mins
in process htmltotext_filter: 54 minutes.

Should also try the htmltotext thing out of process.

See also my comments about CPU usage in
http://code.google.com/p/flaxcode/issues/detail?id=144

Original comment by paul.x.r...@googlemail.com on 23 Nov 2007 at 11:42

GoogleCodeExporter commented 9 years ago

On my machine using build 0.9.0.830 I indexed the HMSO data (HTML only) - 3096 
docs
in 5 mins 23 seconds, which gives a docs/minute rate of 575 - a lot quicker 
than the
200 a minute we were getting previously.

Over the weekend I'll run a huge index over 700,000 HTML docs on my test box 
and see
how long it takes.

C

Original comment by charliej...@gmail.com on 23 Nov 2007 at 1:22

GoogleCodeExporter commented 9 years ago

To complete the set - htmltotext filter running as a remote filter was 55 
minutes.

A few observations:

- For the 4 runs there were no errors or exceptions.
- using htmltotext seems faster than the ifilter, but not by a huge amount
- running remotely or not doesn't seem to make a very big difference to the 
overall time.
- This wasn't CPU time, but actual elapsed time - but not much else was going 
on with
the machine at the time.
- This is a dual core machine - but I think that all the processes were probably
running on the same cpu (see issue 144).
- Watching task manager in the cases where the filters were running in separate
processes suggests that most of the CPU is actually in the main process, not the
filtering process, so if we're looking for speedups then the filter may not 
really be
the main thing to worry about - some profiling would be useful I guess.

Original comment by paul.x.r...@googlemail.com on 23 Nov 2007 at 1:37

GoogleCodeExporter commented 9 years ago

BTW - those timings were for about 55,000 documents. So we're a bit under 1000 
docs/min.

Original comment by paul.x.r...@googlemail.com on 23 Nov 2007 at 2:35

GoogleCodeExporter commented 9 years ago

The code in svn now does what this issue asks for. Are we happy with the status 
quo,
at least for 1.0, or does something more/different need to be done?

Original comment by paul.x.r...@googlemail.com on 28 Nov 2007 at 9:10

GoogleCodeExporter commented 9 years ago

That's fine for now.  Marking as fixed.

Original comment by boulton.rj@gmail.com on 28 Nov 2007 at 9:21

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

With a bit of refactoring and pushing the filtering to separate processes 
running
simultaneously I've done the same collection in ~35 minutes. It does appear 
that the
bottleneck is the work xapian does, which can't run on separate CPUs for a 
single
collection at the moment. We could partition document collections into a number 
of
xapian databases. Since we can search across multiple databases it should be 
possible
to make that map more or less transparent to users.

Another possibility is to generate separate databases and then merge them, I'm 
not
sure whether that would be any faster than just making one database for all the
collections in the first place.

Original comment by paul.x.r...@googlemail.com on 1 Dec 2007 at 12:07

flaxsearch / flaxcode

Replace existing IFilter for HTML and plain text files #142