LanguageMachines / ticcltools

Tools for TICCL
GNU General Public License v3.0
14 stars 3 forks source link

Enable chaining of Ranked variants. #6

Closed kosloot closed 6 years ago

kosloot commented 6 years ago

We want to chain LD variants. care should be taken to not overstretch this.

@martinreynaert will provide more details.

martinreynaert commented 6 years ago

Chaining is something that TICCL has lacked all along. This has meant that for far too long we have not reaped a sizeable proportion of the actual results of our correction work. The here proposed new module (or enhancement of an existing module) is set to remedy that situation.

At this point in time in TICCL's development I can think of no greater possible recall booster than this. [Reminder to self: Cf. paper Cucerzan and Brill]

The idea behind chaining is simple and its implementation in TICCL should be straightforward.

TICCL searches for and links word variants within an imposed Levenshtein distance, usually set at 2. Due to the OCR-process, variants for a correct word form often fall outside this limit. In so far as TICCL does not restrict correction candidates (CCs) to the list of validated word forms, a variant outside the LD-limit may very will be linked up to another word variant within the LD-limit in TICCL's currently final output file *ranked.

So we may have the variant 'Anisterdam' which in the ranked-file should be correctly linked up to the correct name: 'Amsterdam'. We may next have 'Aiiisterdam', which is outside of the LD-limit to the correct form, but may well be linked in *ranked to the variant 'Anisterdam'.

For chaining we therefore start off with the ranked-file. For now this is limited to the list of unigram corrections only. This list should then be sorted descendingly on the frequencies of the CCs in column 4. We will then find at the top of the list, the very highest frequency word types in the corpus. It is also these for which we will have found the most variants.

For chaining to work, we cannot bother with more than the best-first ranked CC in the ranked-list. So this will be applied only to lists containing only the single best-first ranked CCs.

Starting with the top most CC we descend the list and for each variant reported, we see if this variant figures as the CC for yet other variants. We then forget the 'middle man' and report the new variants to be variants for the CC we started with. An example is in order, it seems.

We start off with 'Amsterdam' as the CC. We here limit the output to the top 5. The implementation of course needs to do do this exhaustively.

reynaert@red:/reddata/POLMASH$ grep '#Amsterdam#' PM-SGD.FoliaLangStats.20161116.wordfreqlist.nld.tsv.clean.NoColums1-2-9-11-12.rank.NoUnderscores.ranked |head -n 5 A(msterdam#1#Amsterdam#100133684#1#0.996678 A-nsterdam#1#Amsterdam#100133684#2#0.969005 A.msterdam#14#Amsterdam#100133684#1#0.977556 Ainsterdam#26#Amsterdam#100133684#2#0.940299 Ainsterdam#26#Amsterdam#100133684#2#0.940299

For demo purposes here, we next take variant 'Ainsterdam' with LD 2 as the next CC to search for:

reynaert@red:/reddata/POLMASH$ grep '#Ainsterdam#' PM-SGD.FoliaLangStats.20161116.wordfreqlist.nld.tsv.clean.NoColums1-2-9-11-12.rank.NoUnderscores.ranked |head -n 5 A'nsterdam#2#Ainsterdam#26#1#0.844262 A'nsterdamJ#1#Ainsterdam#26#2#0.5 A.nisterdam#1#Ainsterdam#26#2#0.856115 A.nsterdam#1#Ainsterdam#26#1#0.938144 A.nuterdam#1#Ainsterdam#26#2#0.746667

We again retrieve a sizeable list of variants. We pick the last to continue.

reynaert@red:/reddata/POLMASH$ grep '#A.nuterdam#' PM-SGD.FoliaLangStats.20161116.wordfreqlist.nld.tsv.clean.NoColums1-2-9-11-12.rank.NoUnderscores.ranked |head -n 5 Anulerdam#1#A.nuterdam#1#2#0.867403

We now only retrieve 1 more variant. And we continue.

reynaert@red:/reddata/POLMASH$ grep '#Anulerdam#' PM-SGD.FoliaLangStats.20161116.wordfreqlist.nld.tsv.clean.NoColums1-2-9-11-12.rank.NoUnderscores.ranked |head -n 5 Am3lerdam#1#Anulerdam#1#2#0.890244

We retrieve one more variant and continue:

reynaert@red:/reddata/POLMASH$ grep '#Am3lerdam#' PM-SGD.FoliaLangStats.20161116.wordfreqlist.nld.tsv.clean.NoColums1-2-9-11-12.rank.NoUnderscores.ranked |head -n 5 reynaert@red:/reddata/POLMASH$

This stops the run for this single 'trail'.

The implementation of course needs to follow all trails represented by all the variants retrieved. In our prototype we collected these in a hash, in order to remove the inevitable duplicates.

Each trail of necessity peters out. There is no other stopping mechanism required.

So, all the variants retrieved for all these variants of 'Amsterdam' will be rewritten and reported to be variants for this 'top' CC, here: 'Amsterdam'.

In doing this, we will lose the 'confidence scores'. These may be replaced by a capital 'C', meaning: 'Chained'.

It is advisable to recalculate the LDs as reported in column 5. This will allow to quickly check in the rewritten output for the highest LDs gathered by this method. Of course, mishaps in this may happen and they will be most obvious at these highest LDs. In the SGD correction we have seen pairs reported correctly with an LD of 8, however.

We show the top LDs retrieved for 'Amsterdam' from the Political Mashup version of the SGD by our prototype:

reynaert@red:/reddata/POLMASH$ grep '#Amsterdam#' KEPT.50.LINKtestFRQ37.sortedK4.rewritten.CountTokensCorrected6.PostManual.sortedk4.NoDoubleUnderscores.sortedk2.CatManualCorrectedHead.Stowns.3.withK.lst.ranked |sort -t '#' -k 5 -gr |head -n 5 Ainstenlain#1#Amsterdam#500133773#6#R Atn>terdain#1#Amsterdam#500133773#5#R Atnsteidain#1#Amsterdam#500133773#5#R Arastordara#1#Amsterdam#500133773#5#R Anisteidani#2#Amsterdam#500133773#5#R

In all 'Amsterdam' yielded 461 variants.

reynaert@red:/reddata/POLMASH$ grep '#Amsterdam#' KEPT.50.LINKtestFRQ37.sortedK4.rewritten.CountTokensCorrected6.PostManual.sortedk4.NoDoubleUnderscores.sortedk2.CatManualCorrectedHead.Stowns.3.withK.lst.ranked |sort -t '#' -k 5 -gr |wc

461 461 16947

These also show a Zipf-like distribution in their frequencies (column 2):

reynaert@red:/reddata/POLMASH$ grep '#Amsterdam#' KEPT.50.LINKtestFRQ37.sortedK4.rewritten.CountTokensCorrected6.PostManual.sortedk4.NoDoubleUnderscores.sortedk2.CatManualCorrectedHead.Stowns.3.withK.lst.ranked |sort -t '#' -k 2 -gr |head -n 10 Am_sterdam#1303#Amsterdam#500133773#1#R Amsterda#932#Amsterdam#100000050#1#R Amsterda_m#930#Amsterdam#500133773#1#R Arasterdam#330#Amsterdam#500133773#2#R Oesterdam#87#Amsterdam#500133773#2#R Amstordam#87#Amsterdam#500133773#1#R Amsteidam#53#Amsterdam#500133773#1#R Anisterdam#31#Amsterdam#500133773#2#R Amsterdani#31#Amsterdam#500133773#2#R Ainsterdam#26#Amsterdam#500133773#2#R

One name eventually leads to thousands of OCR post-corrections...

martinreynaert commented 6 years ago

It would be good if for testing this, we could both have access to the same version of your DPO35 test suite, Kobus. On /reddata I suggest. That would include your alphabet and character confusion lists and lexicon.

The first thing required is a single-best only * ranked-list.

martinreynaert commented 6 years ago

Above I wrote: 'the ranked-file. For now this is limited to the list of unigram corrections only. '

This is correct. However, this does not mean that there may not be bigrams (or even trigrams) corrected to unigrams in the list, as is shown by these examples (repeated from above):

Am_sterdam#1303#Amsterdam#500133773#1#R Amsterda#932#Amsterdam#100000050#1#R Amsterda_m#930#Amsterdam#500133773#1#R

martinreynaert commented 6 years ago

So, I have this chained DPO35 best-first ranked here:

reynaert@red:/reddata/DPO35$ sort -t '#' -k 3 /reddata/DPO35/DPO35.chained.KEPT.lst | more

The book having far less highly frequent correct word forms and hence variants than a large corpus such as SGD, the picture is less clear. There are some nice clusters, though. I above sorted the output list on the CC column: #3.

Conftantinopole#5#Constantinopole#100000000#1#R Conflantinopole#2#Constantinopole#100000000#2#R ConJiantinopole#1#Constantinopole#100000000#2#R Conjlantinopole#2#Constantinopole#100000000#2#R Conflantinopele#1#Constantinopole#100000000#3#R Conflan■tinopole#1#Constantinopole#100000000#3#R Conjlantinofole#1#Constantinopole#100000000#3#R Conjlantinopele#1#Constantinopole#100000000#3#R

Note that the actual CC does not appear in the input...

reynaert@red:/reddata/DPO35$ sort -t '#' -k 3 -gr /reddata/DPO35/DPO35.chained.KEPT.lst |grep '#inqui' Jnquifltie#1#inquisitie#100000000#3#R Inquijitie#1#inquisitie#100000000#1#R Inquifitie#3#inquisitie#100000000#1#R

reynaert@red:/reddata/DPO35$ sort -t '#' -k 3 -gr /reddata/DPO35/DPO35.chained.KEPT.lst |grep '#protestanten#' Ptoteflanten#1#protestanten#100000000#3#R Protejlanten#3#protestanten#100000000#2#R Proteflanten#3#protestanten#100000000#2#R

Every LD 3 correction is definitely one that was not in my recall results in my LREC-2014 paper!

Definitely more gain than loss in the top ten highest LD 'corrections' too:

reynaert@red:/reddata/DPO35$ sort -t '#' -k 5 -gr /reddata/DPO35/DPO35.chained.KEPT.lst |head Nemeifche#1#ROMEINSCHE#100000001#4#R Karihageren#1#katharen#100000000#4#R Jmjleidam#1#AMSTERDAM#100000001#4#R Gefchiedciiis#1#geschiedenis#100000000#4#R voörnaamfle#1#voornaamste#100000000#3#R voomaamfte#1#voornaamste#100000000#3#R Terfijche#1#Persische#100000000#3#R Ptoteflanten#1#protestanten#100000000#3#R ovetleedea#1#overleefden#100000000#3#R opmerkdyk#1#opmerkelik#100000000#3#R

kosloot commented 6 years ago

Ok... Voordat ik hier (veel) tijd in steek: Als er een goed, snel werkend script is om die chaining te doen, want is dan de noodzaak om hier een nieuw programma voor te maken? Chaining inbouwen in TICCL-rank is geen goed idee. (de resultaten van ranking komen asynchroon uit allerlei threads en worden niet gecached/hashed op 1-ste of 2-de woord. bijvoorbeeld) Een los script is dan toch prima?

martinreynaert commented 6 years ago

Zowat de helft van de resultaten in mijn LREC2014 paper en dus ook nog van mijn LREC2016 paper waren nog het resultaat van Perl scripts van mij. Dat heeft je/ons toen niet belet die alsnog in professionele C++ code om te bouwen. Nu zou je voorstellen alsnog een Perl module in de verder volledige C++ pipeline in te lassen?

Los daarvan denk ik dat het mogelijk moet zijn ook hier meerdere threads in te zetten om verschillende 'chains' te volgen. Dat moet winst opleveren.

Tenslotte is het niet dat nu resultaten van ranking niet op het eind van de rit gehashed worden, dat dat geen optie zou zijn richting dan verdere 'reranking' of chaining.

kosloot commented 6 years ago

goede argumenten VOOR zijn idd meer threads kunnen gebruiken of snelheidswinst. (indexer in Perl was "vrij traag") Daarmee is niet gezegd dat voor simpele klusjes geen andere oplossing dan C++ zou kunnen zijn. Maar de conclusie is dat het sneller moet. Dan ga ik aan de slag. Ik ben er uiteraard al over aan het nadenken :)

dat hashen slaat op de huidige TICCL-rank, die kan zo snel zijn omdat ie parallel werkt, en de onafhankelijke threads resultaten wegschrijven naar een file. Als je de resultaten gaat opslaan in een hash voor verdere bewerking, dan maakt dat de threads meer afhankelijk, dus trager. EN kost een bak geheugen. Dat laatste is natuurlijk wel uitstel van executie als je de rank file weer moet inlezen en hashen om te chainen. Dus misschien is het op den duur toch wel beter om rank dan maar veel groter te laten groeien en alleen op grote servers te kunen draaien. We zullen zien.

martinreynaert commented 6 years ago

Hi Ko,

That's the spirit! Thanks!

Het is sowieso goed om de huidige output van rank sowieso ook te bewaren.

Martin

kosloot commented 6 years ago

Ik heb een eerste werkende versie nu: Even vergelijken met output van @martinreynaert

Martin: Conftantinopole#5#Constantinopole#100000000#1#R Conflantinopole#2#Constantinopole#100000000#2#R ConJiantinopole#1#Constantinopole#100000000#2#R Conjlantinopole#2#Constantinopole#100000000#2#R Conflantinopele#1#Constantinopole#100000000#3#R Conflan■tinopole#1#Constantinopole#100000000#3#R Conjlantinofole#1#Constantinopole#100000000#3#R Conjlantinopele#1#Constantinopole#100000000#3#R

Ik: Conftantinopole#5#Constantinopole#100000000#1#C Conflantinopole#2#Constantinopole#100000000#2#C ConJiantinopole#1#Constantinopole#100000000#2#C Conjlantinopole#2#Constantinopole#100000000#2#C Conflantinopele#1#Constantinopole#100000000#3#C Conflan■tinopole#1#Constantinopole#100000000#3#C Conjlantinofole#1#Constantinopole#100000000#3#C Conjlantinopele#1#Constantinopole#100000000#3#C Konftantinopole#1#Constantinopole#100000000#2#C

Bijna hetzelfde, wel een extra entry voor Konftantinopole en C ipv R

Martin: Jnquifltie#1#inquisitie#100000000#3#R Inquijitie#1#inquisitie#100000000#1#R Inquifitie#3#inquisitie#100000000#1#R

Ik: Jnquifltie#1#inquisitie#100000000#3#C Inquijitie#1#inquisitie#100000000#2#C Inquifitie#3#inquisitie#100000000#2#C

Verschil zit heer in de LD, maar ik denk dat die van mij goed is...

Martin: Ptoteflanten#1#protestanten#100000000#3#R Protejlanten#3#protestanten#100000000#2#R Proteflanten#3#protestanten#100000000#2#R

Ik: Ptoteflanten#1#protestanten#100000000#4#C Protejlanten#3#protestanten#100000000#3#C Proteflanten#3#protestanten#100000000#3#C

Ook hier lijkt me de LD van @martinreynaert martin twijfelachtig.

kosloot commented 6 years ago

LD verschillen komen door niet meenemen van capitalisatie. Dat moet een optie worden voor chaining.

martinreynaert commented 6 years ago

Ko, er gaat iets fout met --caseless:

reynaert@red:/reddata/TICCLAT/CHAIN$ [1]+ Exit 1 nohup /exp/sloot/usr/local/bin/TICCL-chain -t 60 --caseless /reddata/TICCLAT/CHAIN/KEPT.50.sortedK4.out.ranked -o KEPT.50.sortedK4.out.t60.caseless.ranked.chained > /reddata/TICCLAT/CHAIN/chain.t60.caseless.stdout 2> /reddata/TICCLAT/CHAIN/chain.t60.caseless.stderr reynaert@red:/reddata/TICCLAT/CHAIN$ reynaert@red:/reddata/TICCLAT/CHAIN$ cat /reddata/TICCLAT/CHAIN/chain.t60.caseless.stderr nohup: ignoring input option-error: option 'caseless' may not have a value. (found /reddata/TICCLAT/CHAIN/KEPT.50.sortedK4.out.ranked) usage: /exp/sloot/usr/local/bin/TICCL-chain -t --threads Number of threads to run on. If 'threads' has the value "max", the number of threads is set to a reasonable value. (which can be set with OMP_NUM_TREADS environment variable.) --caseless Calculate the Levensthein (or edit) distance ignoring case.

martinreynaert commented 6 years ago

Verslagje draaien TICCL-chain op grote ranked file (2,7M items):

reynaert@red:/reddata/TICCLAT/CHAIN$ nohup /exp/sloot/usr/local/bin/TICCL-chain -t 60 etc. [22:26] !omdat_het_kan [22:26] * f00f omdat_het_kant, ofzo. [22:27] Kobus_home, had met 1 thread gerund van [22:27] -rw-r--r-- 1 reynaert reynaert 51270194 Mar 15 17:34 chain.stderr [22:27] -rw-r--r-- 1 reynaert reynaert 91 Mar 15 19:17 chain.stdout [22:28] * Flo has quit (Client exited) [22:28] 103 minuten dus [22:31] Nu aan het afzwakken al, maar kwam tot een load average net boven de 30 en zo'n 2400% CPU. [22:40] -rw-r--r-- 1 reynaert reynaert 51270194 Mar 19 22:25 chain.stderr [22:40] -rw-r--r-- 1 reynaert reynaert 92 Mar 19 22:33 chain.stdout [22:41] Met -t 60 op Red dus nu in 8 minuten, Kobus_home [22:42] Dat is 12,88 keer sneller ;0)
kosloot commented 6 years ago

@martinreynaert Als je een fout ziet (denkt te zien) dan graag nieuw issue maken. Dit is veel te verwarrend.

MAAR: hier doe jij het fout, je runt: nohup /exp/sloot/usr/local/bin/TICCL-chain -t 60 --caseless /reddata/TICCLAT/CHAIN/KEPT.50.sortedK4.out.ranked -o KEPT.50.sortedK4.out.t60.caseless.ranked.chained

De inputfile staat hier ergens halverwege het commando. Die moet ALTIJD achteraan. Nu denkt het programma dat het een parameter bij caseless is, en dat zegt de melding ook:

option-error: option 'caseless' may not have a value. (found /reddata/TICCLAT/CHAIN/KEPT.50.sortedK4.out.ranked)

Graag de error message lezen, voor klagen!

kosloot commented 6 years ago

voor nu neem ik aan dat chaining werkt