LanguageMachines / PICCL

A set of workflows for corpus building through OCR, post-correction and normalisation
Other
48 stars 6 forks source link

running piccl to correct words in a simple wordlist #58

Closed Irishx closed 4 years ago

Irishx commented 4 years ago

This is mostly a remark on how you can use ticcl.nf to correct a lexicon-list of words. Piccl is intented for spelling correction at document level. However, it can be applied to a wordlist too and get reasonable results.

So input is a list of words, one word per line. --> you run ticcl.nf --inputtype text The official output of ticcl.nf is an XML file .ticcl.folia.xml but the output of ticcl.nf includes several intermediate files and for a quick look at the corrections without the superfluous XML, you can also use the information in the tsv.clean.ldcalc.ranked file which is list of orig-word,edit-distance-corrected-word,technicalcode-word1,technicalcode-word-2, certainty of algorithm

martinreynaert commented 4 years ago

Thank you Irishx for the above elucidations!

The actual column contents of the output as explained above are not in fact correct. I explain below.

PICCL is a workflow system (based on NextFlow components such as ticcl.nf). TICCL is a (digitized) text correction and normalization system consisting of several modules.

Let us take a look at some actual TICCL system output. What I present next is not TICCL-rank output, but output from the subsequent module, TICCL-chain.

This output is very similar to TICCL-rank's except for the last column. TICCL-chain as input takes TICCL-rank's output.

Both outputs are in fact '#' or hash-separated columns, 7 in all.

I present an extract from a *chained (the file extension for TICCL-chain output) file. This was based on a corpus of Dutch National Archives' Notarial Deeds from the 'Golden Century', Haarlem region. Handwritten Text Recognition courtesy of Transkribus.

We present six HTR (or, possibly, regional diachronic) variants corrected by TICCL to 'schilderijtjes', i.e. small paintings:

schildenijtjes#1#schilderijtjes#100000057#596286601#1#C schildereitjes#1#schilderijtjes#100000057#15434340889#2#C schildereytjes#2#schilderijtjes#100000057#1630347719#2#C schildergties#1#schilderijtjes#100000057#35629811471#3#C schildergtjes#8#schilderijtjes#100000057#23296010607#2#C schilderij_tjes#1#schilderijtjes#100000057#11040808032#1#C

Column 1: word variant Column 2: observed corpus frequency (corpus here was 100K pages of HTR) Column 3: best-first ranked TICCL correction candidate (CC) Column 4: TICCL 'artificial' frequency (here: 100,000,000) augmented with the observed corpus frequency (57) Column 5: Anagram value (AV) difference between the variant and its CC. Denotes a particular character confusion between variant and CC. Column 6: Levenshtein Distance (LD) Column 7: C for 'chained'

Note: underscores in either Column 1 or 3 denote spaces: in the HTR of these Notarial Deeds the last example given above is a bigram, i.e. a split word. We effected word bigram (and trigram) correction on this corpus.

Main differences with TICCL-rank output: a/ TICCL is usually set to work with an LD limit of two edits. TICCL-rank cannot have higher values in Column 6 than the actual limit that was set. TICCL-chain collects variants and can go way higher, here: 3. b/ TICCL-rank in Column 7 gives a kind of confidence measure derived from TICCL's ranking features used in TICCL-LDcalc and TICCL-rank. This is lost during chaining since the original word pairs are often discarded.

Purpose of some columns: Column 4: The artificial frequency is (or can be) assigned by TICCL to word forms of which one is certain or confident they are (or were at some point in time) 'correct' or 'canonical' (whatever your definition of both). In this work we assigned it to all word forms and names we had gathered for Dutch for which we were confident they had at some time, by some instance, been 'humanly-attested'. For more about this, see our work on 'TICCLAT'. Column 5: TICCL is based on what we call 'numerical anagram hashing', more prosaically: counting with words. Each combination of a bag of characters ultimately is expressed as a single large numerical value. The numerical difference between the anagram values of two words denotes a specific difference in particular characters between them, what we call a 'character confusion'. The one before last example above has as character confusion: the HTR recognized the character bigram 'ij' as a single 'g'. The observed corpus frequency for the word variant is quite high: 8. So, one might wonder whether this happened a lot in this particular digitization batch. To find out one might 'grep', i.e. search for, all occurrences of the AV '23296010607' in the full *chained list in order to obtain the stats on this. (The answer is that this substitution occurred quite a lot in this corpus and that the top three (at least) have elevated corpus frequencies: extract: bladzgde#67#bladzijde#100004393#23296010607#2#C (CC = page) vrgwaring#39#vrijwaring#100002135#23296010607#2#C (CC = exemption) kwgting#15#kwijting#100001360#23296010607#2#C (CC = acquittance) )

More info on TICCL's modules is to be found on https://github.com/LanguageMachines/ticcltools, as well as a diagrammatic overview of their interactions.