Cutezjz / galagosearch

Automatically exported from code.google.com/p/galagosearch
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

'galago make-corpus' dies if no <DOCNO> found at the beginning of a line. #32

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
When building  a corpus from TRECTEXT formatted documents, if no <DOCNO> is 
found, the process dies with a NullPointerException.  

TrecTextParser.waitFor can return null, and this condiiton needs to be handled 
in TrecTextParser.parseDocNumber.

This behavior was encountered when parsing a TRECTEXT file which had some 
leading whitespace on lines within the <DOC>..</DOC> tags.

Original issue reported on code.google.com by jel...@gmail.com on 23 Jul 2010 at 5:42

GoogleCodeExporter commented 8 years ago

The format of trectext documents requires that <DOCNO> be at the beginning of 
the line.

Even so, the parser will now handle the null returned value, returning a null 
document to the UniversalParser + ignoring the remainder of the file.

Original comment by sjh...@gmail.com on 21 Jun 2011 at 3:48