CIIR / Proteus

Million Book Project
8 stars 5 forks source link

Handling Zero Length Documents #102

Closed mzarozinski closed 7 years ago

mzarozinski commented 8 years ago

Many books have the first few pages completely blank (ex: https://archive.org/stream/terrestrialmagn00survgoog#page/n0/mode/2up)

When indexing the pages of this book with a field such as publication date, if we search only by the date field, we often get "no results found". This is because the top K results have a length of zero and RankedDocumentModel skips them.

Should we even index zero length docs? If so, we'll need to ignore them during scoring.

Note that some zero length docs may be in the middle of a book and contain an image which may be valuable to the user.

mzarozinski commented 7 years ago

This was fixed in Galago with commit: https://sourceforge.net/p/lemur/galago/ci/ac81b3debbeb5d127d52a0b07f7cf47bd89dd7c5/