Closed jvdzwaan closed 5 years ago
Setup framework/pipeline for validating TICCL (+ extra metadata) given different configurations of input data and parameters.
Input for performance measurements.
Define configurations:
What is the effect on performance of configuration X
in OCR post correction tasks?
How does configuration X
perform on lexical assessment tasks? "Is this a word?"
When we have data and validation pipeline, we can run!
https://github.com/TICCLAT/explore/blob/master/notebooks/folia2lexicon.ipynb
The results for the OCR evaluation part of the sprint can be found on this address: https://github.com/TICCLAT/evaluation_pipeline/tree/master/evaluation
@jvdzwaan were all the goals achieved? Can we consider it done and close this issue?
It's a work in progress and we'll continue the work in the project. The issue can be closed.
The TICCLAT project is about extending TICCL, software that does ocr post correction and/or spelling correction and/or word normalization based on what word forms it sees in the corpus. In this project we want to run a number of experiments to evaluate the performance of different configurations of TICCL. The sprint will be about setting up the pipeline/infrastructure to run these experiments. We will focus on the task of OCR post-correction and have a data set available. Hopefully, we'll be able to run the baseline experiment at the end of the sprint.
Together with @egpbos