flaxsearch / flaxcode

Automatically exported from code.google.com/p/flaxcode
4 stars 1 forks source link

Indexing error while indexing HTML #149

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Test run on my P4 test box, running Win2003 Server, Flax 0.9.0.840: after
indexing around 123,000 HTML documents repeated error with all subsequent
files:

2007-11-23 21:26:33,279: indexing.remote: 1616: ERROR: Filtering file:
D:\nclBHTML\14\34\f31.html with filter: <function html_filter at
0x00B88DB0> raised exception Expected another key with the same term name
but found a different one, skipping:

Original issue reported on code.google.com by charliej...@gmail.com on 26 Nov 2007 at 9:55

GoogleCodeExporter commented 9 years ago
Also occasional errors of the form:

2007-11-24 23:48:58,575: indexing.remote: 1616: ERROR: Filtering file:
d:\nclAHTML\04\23\f81.html with filter: <function html_filter at 0x00B88DB0> 
raised
exception Couldn't read enough (EOF), skipping:

Original comment by charliej...@gmail.com on 26 Nov 2007 at 9:56

GoogleCodeExporter commented 9 years ago
This build used html_filter in process I guess? Perhaps we're better off using 
it out
of process? The timings I made I ran suggest that it doesn't make much 
difference
performance-wise - see issue 
http://code.google.com/p/flaxcode/issues/detail?id=142.

Original comment by paul.x.r...@googlemail.com on 26 Nov 2007 at 10:11

GoogleCodeExporter commented 9 years ago
OK, that would no doubt be safer. The first report appears to be a Xapian 
problem;
I'll run xapian-check (the new, working patched version) on the db and see what 
it says.

Original comment by charliej...@gmail.com on 26 Nov 2007 at 11:24

GoogleCodeExporter commented 9 years ago
These are both errors from flint; ie, from the indexing code in Xapian, not 
from the
filter.  In fact, they're database corruption errors.  Which is the _first_ 
error to
appear in the log file - the "Couldn't read enough" or the "Expected another 
key"
error?  I want to know which is the cause, and which the effect, basically.

Original comment by boulton.rj@gmail.com on 26 Nov 2007 at 11:40

GoogleCodeExporter commented 9 years ago
The first database checks out OK with xapian-check (123,000 docs). The second 
(31,000
docs) reports an error in the postlist:

postlist:
baseA blocksize=8K items=692030 lastblock=106332 revision=68 levels=2 root=9
B-tree error 40
xapian-check: btree error

Original comment by charliej...@gmail.com on 26 Nov 2007 at 11:45

GoogleCodeExporter commented 9 years ago
The 'first' is nclBHTML and the first error logged for this db is:

2007-11-23 21:26:33,279: indexing.remote: 1616: ERROR: Filtering file:
D:\nclBHTML\14\34\f31.html with filter: <function html_filter at 0x00B88DB0> 
raised
exception Expected another key with the same term name but found a different 
one,
skipping:

about 900 files later the error changes to:

2007-11-24 01:07:05,591: indexing.remote: 1616: ERROR: Filtering file:
D:\nclBHTML\14\52\f81.html with filter: <function html_filter at 0x00B88DB0> 
raised
exception Data ran out unexpectedly when reading posting list., skipping:

The 'second' database is NclAHTML. The first error reported for this is:

2007-11-24 18:43:52,825: indexing.remote: 1616: ERROR: Filtering file:
d:\nclAHTML\04\10\f29.html with filter: <function html_filter at 0x00B88DB0> 
raised
exception Expected another key with the same term name but found a different 
one,
skipping:

About 30 files later we get:

2007-11-24 18:48:00,653: indexing.remote: 1616: ERROR: Filtering file:
d:\nclAHTML\04\10\f51.html with filter: <function html_filter at 0x00B88DB0> 
raised
exception Couldn't read enough (EOF), skipping:

and this occurs relatively often, in between recurrences of the first error,
subsequently.

The fact that the second db fails a lot quicker than the first makes me wonder 
if
disk space is still an issue; there's plenty on the database folder, but I 
wonder if
Windows is running out of swap space. I'll check if this is likely.

Original comment by charliej...@gmail.com on 26 Nov 2007 at 11:50

GoogleCodeExporter commented 9 years ago
Isn't swap space is fixed according to the size of pagefile.sys? If so it might 
be
that it's running out of virtual memory - but that's not because disk space is
running low?

Original comment by paul.x.r...@googlemail.com on 26 Nov 2007 at 11:58

GoogleCodeExporter commented 9 years ago
It looks like swap space isn't an issue; there's a good 3GB spare on C:\ and 
Windows
is not set up to try to use any more than this. I've set it going again on the 
db
that fails faster, so we can perhaps catch it in the act.

Original comment by charliej...@gmail.com on 26 Nov 2007 at 12:01

GoogleCodeExporter commented 9 years ago
Repeating the test on 'NclA' doesn't repeat the error at 31278 documents, so it 
isn't
happening consistently at the same place.

Original comment by charliej...@gmail.com on 26 Nov 2007 at 3:20

GoogleCodeExporter commented 9 years ago
Managed to reproduce again after indexing 18,000 or so documents:

2007-11-26 16:51:00,905: indexing.remote: 1996: ERROR: Filtering file:
d:\nclAHTML\02\81\f41.html with filter: <function html_filter at 0x00B88DB0> 
raised
exception Expected another key with the same term name but found a different 
one,
skipping:

Original comment by charliej...@gmail.com on 26 Nov 2007 at 5:03

GoogleCodeExporter commented 9 years ago
Charlie mentioned that this might be due to hardware problems.  Marking as 
Clarify;
need more information.

Original comment by boulton.rj@gmail.com on 28 Nov 2007 at 8:46

GoogleCodeExporter commented 9 years ago
What exactly is this dataset? I could try repeating on my machine if it's 
available
for download somewhere.

If it's a hardware error then I guess we can probably forget about it. Although
possibly we should identify errors from xapian that indicate some kind of 
database
corruption and either give up the indexing attempt, or delete the database and 
restart?

Original comment by paul.x.r...@googlemail.com on 1 Dec 2007 at 12:23

GoogleCodeExporter commented 9 years ago
I've run numerous indexes on Mongoose and not seen any of these errors; I'm
suspecting it's a disk or memory issue. I'll try and test the test box.

Original comment by charliej...@gmail.com on 5 Dec 2007 at 11:27

GoogleCodeExporter commented 9 years ago
The consensus seems to be that this was probably due to hardware errors on one
particular machine. I'm therefore closing it. Please reopen if it can be 
reproduce on
other hardware. I'm not sure what the right resolution is. I'll use WontFix.

Original comment by paul.x.r...@googlemail.com on 7 Dec 2007 at 9:09