Ochre is a toolbox for OCR post-correction. Please note that this software is experimental and very much a work in progress!
Ochre contains ready-to-use data processing workflows (based on CWL). The software also allows you to create your own (OCR post-correction related) workflows. Examples of how to create these can be found in the notebooks directory (to be able to use those, make sure you have Jupyter Notebooks installed). This directory also contains notebooks that show how results can be analyzed and visualized.
git clone git@github.com:KBNLresearch/ochre.git
cd ochre
pip install -r requirements.txt
python setup.py develop
cwltool|cwl-runner path/to/workflow.cwl <inputs>
(if you run the command without inputs, the tool will tell you about what inputs are required and how to specify them). For more information on running CWL workflows, have a look at the nlppln documentation. This is especially relevant for Windows users.The software needs the data in the following formats:
{
"ocr": ["E", "x", "a", "m", "p", "", "c"],
"gs": ["E", "x", "a", "m", "p", "l", "e"]
}
Corresponding files in these directories should have the same name (or at least the same prefix), for example:
├── gs
│ ├── 1.txt
│ ├── 2.txt
│ └── 3.txt
├── ocr
│ ├── 1.txt
│ ├── 2.txt
│ └── 3.txt
└── aligned
├── 1.json
├── 2.json
└── 3.json
To create data in these formats, CWL workflows are available.
First run a preprocess workflow to create the gs
and ocr
directories containing the expected files.
Next run an align workflow to create the align
directory.
vudnc-preprocess-pack.cwl
(can be run as stand-alone; associated notebook vudnc-preprocess-workflow.ipynb)icdar2017st-extract-data-all.cwl
(cannot be run as stand-alone;
regenerate with notebook ICDAR2017_shared_task_workflows.ipynb)To create the alignments, run one of:
align-dir-pack.cwl
to align all files in the gs
and ocr
directoriesalign-test-files-pack.cwl
to align the test files in a data divisionThese workflows can be run as stand-alone; associated notebook align-workflow.ipynb.
First, you need to divide the data into a train, validation and test set:
python -m ochre.create_data_division /path/to/aligned
The result of this command is a json file containing lists of file names, for example:
{
"train": ["1.json", "2.json", "3.json", "4.json", "5.json", ...],
"test": ["6.json", ...],
"val": ["7.json", ...]
}
lstm_synched.py
If you trained a model, you can use it to correct OCR text using the lstm_synced_correct_ocr
command:
python -m ochre.lstm_synced_correct_ocr /path/to/keras/model/file /path/to/text/file/containing/the/characters/in/the/training/data /path/to/ocr/text/file
or
cwltool /path/to/ochre/cwl/lstm_synced_correct_ocr.cwl --charset /path/to/text/file/containing/the/characters/in/the/training/data --model /path/to/keras/model/file --txt /path/to/ocr/text/file
The command creates a text file containing the corrected text.
To generate corrected text for the test files of a dataset, do:
cwltool /path/to/ochre/cwl/post_correct_test_files.cwl --charset /path/to/text/file/containing/the/characters/in/the/training/data --model /path/to/keras/model/file --datadivision /path/to/data/division --in_dir /path/to/directory/with/ocr/text/files
To run it for a directory of text files, use:
cwltool /path/to/ochre/cwl/post_correct_dir.cwl --charset /path/to/text/file/containing/the/characters/in/the/training/data --model /path/to/keras/model/file --in_dir /path/to/directory/with/ocr/text/files
(these CWL workflows can be run as stand-alone; associated notebook post_correction_workflows.ipynb)
To calculate performance of the OCR (post-correction), the external tool ocrevalUAtion is used. More information about this tool can be found on the website and wiki.
Two workflows are available for calculating performance. The first calculates performance for all files in a directory. To use it type:
cwltool /path/to/ochre/cwl/ocrevaluation-performance-wf-pack.cwl#main --gt /path/to/dir/containing/the/gold/standard/ --ocr /path/to/dir/containing/ocr/texts/ [--out_name name-of-output-file.csv]
The second calculates performance for all files in the test set:
cwltool /path/to/ochre/cwl/ocrevaluation-performance-test-files-wf-pack.cwl --datadivision /path/to/datadivision.json --gt /path/to/dir/containing/the/gold/standard/ --ocr /path/to/dir/containing/ocr/texts/ [--out_name name-of-output-file.csv]
Both of these workflows are stand-alone (packed). The corresponding Jupyter notebook is ocr-evaluation-workflow.ipynb.
To use the ocrevalUAtion tool in your workflows, you have to add it to the WorkflowGenerator's
steps
library:
wf.load(step_file='https://raw.githubusercontent.com/nlppln/ocrevaluation-docker/master/ocrevaluation.cwl')
Different types of OCR errors exist, e.g., structural vs. random mistakes. OCR post-correction methods may be suitable for fixing different types of errors. Therefore, it is useful to gain insight into what types of OCR errors occur. We chose to approach this problem on the word level. In order to be able to compare OCR errors on the word level, words in the OCR text and gold standard text need to be mapped. CWL workflows are available to do this. To create word mappings for the test files of a dataset, use:
cwltool /path/to/ochre/cwl/word-mapping-test-files.cwl --data_div /path/to/datadivision --gs_dir /path/to/directory/containing/the/gold/standard/texts --ocr_dir /path/to/directory/containing/the/ocr/texts/ --wm_name name-of-the-output-file.csv
To create word mappings for two directories of files, do:
cwltool /path/to/ochre/cwl/word-mapping-wf.cwl --gs_dir /path/to/directory/containing/the/gold/standard/texts/ --ocr_dir /path/to/directory/containing/the/ocr/texts/ --wm_name name-of-the-output-file.csv
(These workflows can be regenerated using the notebook word-mapping-workflow.ipynb.)
The result is a csv-file containing mapped words. The first column contains a word id, the second column the gold standard text and the third column contains the OCR text of the word:
,gs,ocr
0,Hello,Hcllo
1,World,World
2,!,.
This csv file can be used to analyze the errors. See notebooks/categorize errors based on word mappings.ipynb
for an example.
We use heuristics to categorize the following types of errors (ochre/ocrerrors.py
):
Jupyter notebook
Copyright (c) 2017-2018, Koninklijke Bibliotheek, Netherlands eScience Center
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.