jvdzwaan commented 6 years ago

The TICCLAT project is about extending TICCL, software that does ocr post correction and/or spelling correction and/or word normalization based on what word forms it sees in the corpus. In this project we want to run a number of experiments to evaluate the performance of different configurations of TICCL. The sprint will be about setting up the pipeline/infrastructure to run these experiments. We will focus on the task of OCR post-correction and have a data set available. Hopefully, we'll be able to run the baseline experiment at the end of the sprint.

Together with @egpbos

egpbos commented 5 years ago

Main goal

Setup framework/pipeline for validating TICCL (+ extra metadata) given different configurations of input data and parameters.

Data preparation

DBNL data (from KB): XML file with corrected OCR'ed text (TEI format) + text file with uncorrected text + metadata
- TEI format has footnotes/endnotes, references added, complex junk, needs to be cleaned / decided how to handle
- Alignment
- What are differences? Determine how much are the texts alike so you can make balanced train/validation sets.
- Split into training and validation sets
- If necessary to get started on other task, as zeroth order attempt: just randomly split.

Define research questions:

Input for performance measurements.

Define configurations:
- Only test corpus input
- - Lexicon
- - Names
- - Time, Location, Source (attestation) metadata
- and combinations of several and all of these.
What is the effect on performance of configuration X in OCR post correction tasks?
- What is the effect of adding lexicons (with historical spelling variation)? How much does it help and in which cases (if any) does it decrease performance?
How does configuration X perform on lexical assessment tasks? "Is this a word?"
- Either in context or in isolation

Performance assessment pipeline

Define and program performance tests
- check existing work: e.g. OCR evaluation tool or Docker version and Martin Reynaert's previous work
Define input configuration options
- Input datasets: corpus, lexicon, names
- Input parameters: artifreq, weights of ranking features, use of years/periods/source
Output format/visualization
- Check @jvdzwaan's previous KB OCR post-correction work (offline)

Measure performance

When we have data and validation pipeline, we can run!

Baseline performance: using vanilla TICCL settings
Configurations performance

Database setup: from TICCL to TICCLAT

https://github.com/TICCLAT/explore/blob/master/notebooks/folia2lexicon.ipynb

eriktks commented 5 years ago

The results for the OCR evaluation part of the sprint can be found on this address: https://github.com/TICCLAT/evaluation_pipeline/tree/master/evaluation

romulogoncalves commented 5 years ago

@jvdzwaan were all the goals achieved? Can we consider it done and close this issue?

jvdzwaan commented 5 years ago

It's a work in progress and we'll continue the work in the project. The issue can be closed.

NLeSC / TEAM2018