CIIR / Proteus

Million Book Project
8 stars 5 forks source link

Off by one error between OCR and page image #117

Open mzarozinski opened 8 years ago

mzarozinski commented 8 years ago

See document cu31924020438929

The actual book has: page 32, a full page image, blank page, page 33. In Proteus book page number 33 is associated with the text for page 34.

https://archive.org/stream/cu31924020438929#page/n55/mode/2up http://laguna.cs.umass.edu:2333/view.html?kind=ia-pages&action=view&id=cu31924020438929_55

mzarozinski commented 7 years ago

This appears to be an issue with the transformation from rawtei to toktei. In the DjVu and rawtei "page" (really image offset) 56 contains "o X". The actual page (https://archive.org/stream/cu31924020438929#page/n56/mode/1up) is an illustration, the next page is blank.

The Phokas program pulled the contents of that "page" into the previous page resulting in offset 56 being removed, followed by (correctly) offset 57 being blank.

The page index was built using toktei (list of document for that index is on sydney at /mnt/nfs/work3/michaelz/data/caribbean-via-grep.list). Proteus expects to see "o X" for offset 56 (the illustration) but that does not exist, resulting in the off by one error.

Attached are the rawtei and toktei files. Search for "" in the rawtei, and in the toktei to see the issue.

Ultimately the solution is to either fix Phokas or build the index using rawtei files. My experience has been that building from the rawtei files is the best way to proceed.

cu31924020438929.rawtei.gz cu31924020438929.toktei.gz