Calamari-OCR / calamari

Line based ATR Engine based on OCRopy
GNU General Public License v3.0
1.05k stars 209 forks source link

Empty predictions with the default model #182

Closed Witiko closed 4 years ago

Witiko commented 4 years ago

Consider the following image of the 2018 JLCL Calamari paper abstract:

abstract

Running calamari-predict with the the pretrained antiqua model on this image produces an empty prediction:

$ curl https://user-images.githubusercontent.com/603082/88580938-b3272380-d04c-11ea-9c6f-cd677fb5cb0d.png > abstract.png
$ git clone https://github.com/Calamari-OCR/calamari_models
$ mkvirtualenv -p `which python3` calamari
$ workon calamari
(calamari) $ pip install calamari_ocr
(calamari) $ calamari-predict --checkpoint calamari_models/antiqua_modern/4.ckpt abstract.png --output_dir .
...
Prediction: 100%|███████████████| 1/1 [00:02<00:00,  2.50s/it]
Prediction of 1 models took 2.5678038597106934s
Average sentence confidence: 0.00%
All files written
$ wc -l abstract.pred.txt
0 abstract.pred.txt

Is this expected behavior, or is there some mistake on my part? If there is no mistake on my part, could this be because the default models have been trained on an unrepresentative dataset?

amitdo commented 4 years ago

See issue #175.

amitdo commented 4 years ago

Prediction of a page

Currently only OCR on lines is supported. Modules to segment pages into lines will be available soon. In the meantime you should use the scripts provided by OCRopus.

@ChWick, about the 'soon' part, it was stated in commit e3e6099a7045, in April 2018. Maybe you should remove it.