Closed GoogleCodeExporter closed 9 years ago
At one point we did have something on the options page to configure which is
used for
a given file type. Maybe we should reinstate that, at least to choose in simple
cases
like that one?
It's hard to compare different filters if you have to change the code to do
so...
Original comment by paul.x.r...@googlemail.com
on 22 Nov 2007 at 5:30
The other thing that occurs to me is we don't really know how much of the issue
is to
do with the remote nature of the filter and how much is to do with the actual
filter.
We could run (some) IFilters in process - which would certainly speed things
up.
There is of course a worry about the filter bringing the whole process down,
but as
far as I know we have only really seen problems with the adobe ifilter so far.
Original comment by paul.x.r...@googlemail.com
on 22 Nov 2007 at 5:58
By all means we should do some profiling and tests - but we absolutely mustn't
try to
change the UI at this stage. Maybe after 1.0.
Original comment by boulton.rj@gmail.com
on 22 Nov 2007 at 8:42
Oh - also, I don't like the idea of bringing the ifilter stuff in-process at
all; it
may be safe for some IFilters, but it's definitely not for others, and we have
no
control over which IFilters are installed on the machines. (We can, and do,
suggest
changes in the installer, but we can't control things, and we can't really do
anything about filters installed after flax is installed, for now.)
It's a bit of a shame if flax is slower than we'd like for some filetypes -
it's a
disaster if it's unstable and unable to reliably complete indexing runs.
Original comment by boulton.rj@gmail.com
on 22 Nov 2007 at 8:44
(That said, I'm not against testing with the ifilter stuff in-process to see
how it
affects performance - I'm just against releasing anything which does that, at
present.)
Original comment by boulton.rj@gmail.com
on 22 Nov 2007 at 8:45
I think for now it would be best if we just mod the code directly to enable a
different plain text and HTML filter. If we can do this reasonably quickly I
can run
some comparative speed tests over the weekend...
Original comment by charliej...@gmail.com
on 23 Nov 2007 at 9:56
Quick comparison on my machine using the hmso data set with just html file type
selected:
remote ifilter: 1 hour 14 mins
in process ifilter: 1 hour 23 mins
in process htmltotext_filter: 54 minutes.
Should also try the htmltotext thing out of process.
See also my comments about CPU usage in
http://code.google.com/p/flaxcode/issues/detail?id=144
Original comment by paul.x.r...@googlemail.com
on 23 Nov 2007 at 11:42
On my machine using build 0.9.0.830 I indexed the HMSO data (HTML only) - 3096
docs
in 5 mins 23 seconds, which gives a docs/minute rate of 575 - a lot quicker
than the
200 a minute we were getting previously.
Over the weekend I'll run a huge index over 700,000 HTML docs on my test box
and see
how long it takes.
C
Original comment by charliej...@gmail.com
on 23 Nov 2007 at 1:22
To complete the set - htmltotext filter running as a remote filter was 55
minutes.
A few observations:
- For the 4 runs there were no errors or exceptions.
- using htmltotext seems faster than the ifilter, but not by a huge amount
- running remotely or not doesn't seem to make a very big difference to the
overall time.
- This wasn't CPU time, but actual elapsed time - but not much else was going
on with
the machine at the time.
- This is a dual core machine - but I think that all the processes were probably
running on the same cpu (see issue 144).
- Watching task manager in the cases where the filters were running in separate
processes suggests that most of the CPU is actually in the main process, not the
filtering process, so if we're looking for speedups then the filter may not
really be
the main thing to worry about - some profiling would be useful I guess.
Original comment by paul.x.r...@googlemail.com
on 23 Nov 2007 at 1:37
BTW - those timings were for about 55,000 documents. So we're a bit under 1000
docs/min.
Original comment by paul.x.r...@googlemail.com
on 23 Nov 2007 at 2:35
The code in svn now does what this issue asks for. Are we happy with the status
quo,
at least for 1.0, or does something more/different need to be done?
Original comment by paul.x.r...@googlemail.com
on 28 Nov 2007 at 9:10
That's fine for now. Marking as fixed.
Original comment by boulton.rj@gmail.com
on 28 Nov 2007 at 9:21
With a bit of refactoring and pushing the filtering to separate processes
running
simultaneously I've done the same collection in ~35 minutes. It does appear
that the
bottleneck is the work xapian does, which can't run on separate CPUs for a
single
collection at the moment. We could partition document collections into a number
of
xapian databases. Since we can search across multiple databases it should be
possible
to make that map more or less transparent to users.
Another possibility is to generate separate databases and then merge them, I'm
not
sure whether that would be any faster than just making one database for all the
collections in the first place.
Original comment by paul.x.r...@googlemail.com
on 1 Dec 2007 at 12:07
Original issue reported on code.google.com by
charliej...@gmail.com
on 22 Nov 2007 at 5:15