NLeSC / TEAM2018

This is the repo for the 2018 TEAM sprint
2 stars 3 forks source link

TICCLAT #49

Closed jvdzwaan closed 5 years ago

jvdzwaan commented 6 years ago

The TICCLAT project is about extending TICCL, software that does ocr post correction and/or spelling correction and/or word normalization based on what word forms it sees in the corpus. In this project we want to run a number of experiments to evaluate the performance of different configurations of TICCL. The sprint will be about setting up the pipeline/infrastructure to run these experiments. We will focus on the task of OCR post-correction and have a data set available. Hopefully, we'll be able to run the baseline experiment at the end of the sprint.

Together with @egpbos

egpbos commented 5 years ago

Main goal

Setup framework/pipeline for validating TICCL (+ extra metadata) given different configurations of input data and parameters.

Data preparation

Define research questions:

Input for performance measurements.

Performance assessment pipeline

Measure performance

When we have data and validation pipeline, we can run!

Database setup: from TICCL to TICCLAT

https://github.com/TICCLAT/explore/blob/master/notebooks/folia2lexicon.ipynb

eriktks commented 5 years ago

The results for the OCR evaluation part of the sprint can be found on this address: https://github.com/TICCLAT/evaluation_pipeline/tree/master/evaluation

romulogoncalves commented 5 years ago

@jvdzwaan were all the goals achieved? Can we consider it done and close this issue?

jvdzwaan commented 5 years ago

It's a work in progress and we'll continue the work in the project. The issue can be closed.