Open martinreynaert opened 3 years ago
You got me confused: The issue is mentioning TiCCL-rank, but the text seems to suggest the problem is in TiCCL-LDcalc already??
Anyway: to analyse this, I need a MINIMAL working example of the input files for LDcalc:
so a SMALL index-file, hash-file and clean file, preferable with just about 10 words or so demonstrating the problem. could you please provide me with those?
OK. I attach a tar.gz containing clean, anahash, corpusfoci en ldcalc files. Also a file TICCL.commandlinesTESTSAMPLE.20210105.txt which contains the command lines used.
TICCL.TestSample.LDcalcRestrictionUnderscoreHyphen.20210105.tar.gz
Note I did not use the corpusfoci file here. That is meant to reduce the workload, without TICCL-indexer works exhaustively gathering all the possible character confusion word pairs present. But seeing there is so little here, all these modules run in just seconds. These files should amply illustrate the problem, the ldcalc file in fact unavoidably gives some more examples of the same filtering than I listed above. Thanks! Looking forward to the result!
I have now also run TICCL-rank on this.
Command line: reynaert@violet:/reddata/NATAR/TESTSAMPLE$ /exp/sloot/usr/local/bin/TICCL-rank -t 1 --alph /reddata/POLMASH/TRI/ALPH/nld.aspell.dict.clip20.lc.chars --charconf /reddata/POLMASH/TRI/ALPH/nld.aspell.dict.clip20.ld2.charconfus -o /reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK --debugfile /reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANKDEBUG --artifrq 0 --clip 1 --skipcols=1,10,11,13 /reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.ldcalc >/reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.20210105.stdout 2>/reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.20210105.stderr
Output:
reynaert@violet:/reddata/NATAR/TESTSAMPLE$ cat NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.ranked Hon_derd#1#honderd#110122179#11040808032#1#0.697674 hon_derd#22#honderd#110122179#11040808032#1#0.697674 Ge-arresteerd#1#gearresteerd#110002708#35723051649#1#0.860759 ge-arresteerd#173#gearresteerd#110002708#35723051649#1#0.738095 ge-arresteerde#5#gearresteerde#110000650#35723051649#1#0.932961 hon_derd_twaalf#1#honderdtwaalf#108765437#22081616064#2#1 Aan_hon_derd#2#van_honderd#98768224#871099262#2#0.52 der_hon_derd#1#de_honderd#98767370#23803623657#2#0.918367 en_hon_derd#1#Een_honderd#98766569#551932711#2#0.945833 even_hon_derd#1#Een_honderd#98766569#36978232633#2#0.694444 Te_Hon_derd#1#ten_honderd#98766330#1125720992#2#0.938776 is_hon_derd#1#in_honderd#98766169#14260518557#2#1 Van_hon_derd#1#Aan_honderd#98766151#22952715326#2#0.52 van_hon_derd#1#Aan_honderd#98766151#22952715326#2#0.52 Het_hon_derd#2#met_honderd#98766097#11993905243#2#1 voor_hon_derd#1#door_honderd#98765638#19354815801#2#1
It should be obvious that 'voor' to 'door' and 'van' to 'Aan' confusions are counterproductive.
MRE
This is not immedialtely pertinent to the actual issue involved here. But kind of illustrates the consequences of what goes wrong due to the current filtering in TICCL-LDcalc.
I have now also run TICCL-chainclean (also with -v and -v -v), which was interesting, although I do not really understand what happens.
Command line: reynaert@violet:/reddata/NATAR/TESTSAMPLE$ /exp/sloot/usr/local/bin/TICCL-chainclean -v --lexicon /reddata/NATAR/UNKMERGE/NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.clean --artifrq 100000000 --low=6 -o /reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.CHAINCLEAN /reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.CHAIN.chained >/reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.CHAINCLEAN.20210105.stdout 2>/reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.CHAINCLEAN.20210105.stderr
The result is that it retains only 5 lines of the 16 in chained. (Actually TICCL-chain could not 'chain' any of the 16 lines in ranked.) The other 11 lines are written to a file *deleted.
Output: reynaert@violet:/reddata/NATAR/TESTSAMPLE$ cat /reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.CHAINCLEAN Hon_derd#1#honderd#110122179#11040808032#1#C Ge-arresteerd#1#gearresteerd#110002708#35723051649#1#C ge-arresteerd#173#gearresteerd#110002708#35723051649#1#C ge-arresteerde#5#gearresteerde#110000650#35723051649#1#C hon_derd_twaalf#1#honderdtwaalf#108765437#22081616064#2#C reynaert@violet:/reddata/NATAR/TESTSAMPLE$ reynaert@violet:/reddata/NATAR/TESTSAMPLE$ cat /reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.CHAINCLEAN.deleted hon_derd#22#honderd#110122179#11040808032#1#D Aan_hon_derd#2#van_honderd#98768224#871099262#2#D der_hon_derd#1#de_honderd#98767370#23803623657#2#D en_hon_derd#1#Een_honderd#98766569#551932711#2#D even_hon_derd#1#Een_honderd#98766569#36978232633#2#D Te_Hon_derd#1#ten_honderd#98766330#1125720992#2#D is_hon_derd#1#in_honderd#98766169#14260518557#2#D Van_hon_derd#1#Aan_honderd#98766151#22952715326#2#D van_hon_derd#1#Aan_honderd#98766151#22952715326#2#D Het_hon_derd#2#met_honderd#98766097#11993905243#2#D voor_hon_derd#1#door_honderd#98765638#19354815801#2#D
I am still trying to figure out what it actually tries to do on the basis of the *stderr. I definitely do not agree that the pair hon_derd#22#honderd#110122179#11040808032#1#D should be deleted.
I attach the stderr file for the sake of completeness. I added the extension txt to be able to actually upload it here... NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.CHAINCLEAN.20210105.vv.stderr.txt
So, in hopes of seeing a causal relation between unigram and bi/trigram retrieval of a pair differing only in an underscore or a hyphen, I extracted from the testsample index the line for single underscore confusion and the line for single hyphen confusion, to separate new index files. I ran these with and without the value for the unigram with TICCL-LDcalc. I.e. with and without the values (in the anahash lines: '110751596624~Honderd#honderd' and '163501191104~Gearresteerd#gearresteerd'). When these are present, the pair is retrieved. The corresponding bi/trigrams are not. When these are not present, the pair is necessarily not retrieved. The corresponding bi/trigrams are not either. I conclude there is no causal relation (e.g. possible filter on bi/trigrams not being retrieved after the corresponding unigram has been validated and retrieved) between the two.
I have tried with 'follow='. Definitely interesting! But I do not get the last lines: 'ignoring'. I paste the lot here.
reynaert@maize:/reddata/NATAR/TESTSAMPLE/ZIP$ [1]+ Done nohup /exp/sloot/usr/local/bin/TICCL-LDcalc -v -v --follow=Te_Hon_derd --threads 1 --LD 2 --low=6 --high=50 --index /reddata/NATAR/TESTSAMPLE/ZIP/Underscore.11040808032.testsample.No_Honderd.index --hash /reddata/NATAR/TESTSAMPLE/ZIP/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.anahash --clean /reddata/NATAR/TESTSAMPLE/ZIP/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.NumSortDes.clean --alph /reddata/POLMASH/TRI/ALPH/nld.aspell.dict.clip20.lc.chars --artifrq 98765432 -o /reddata/NATAR/TESTSAMPLE/ZIP/Underscore.11040808032.testsample.No_Honderd.index.LDCALC > /reddata/NATAR/TESTSAMPLE/ZIP/Underscore.11040808032.testsample.No_Honderd.index.LDCALC.20210107.stdout 2> /reddata/NATAR/TESTSAMPLE/ZIP/Underscore.11040808032.testsample.No_Honderd.index.LDCALC.20210107.stderr reynaert@maize:/reddata/NATAR/TESTSAMPLE/ZIP$ reynaert@maize:/reddata/NATAR/TESTSAMPLE/ZIP$ cat /reddata/NATAR/TESTSAMPLE/ZIP/Underscore.11040808032.testsample.No_Honderd.index.LDCALC.20210107.stderr nohup: ignoring input skip hash for te (not in lexicon) skip hash for Ten (not in lexicon) skip hash for ten (not in lexicon) skip hash for Ter (not in lexicon) skip hash for ter (not in lexicon) skip hash for Tes (not in lexicon) skip hash for Tel (not in lexicon) skip hash for Tent (not in lexicon) skip hash for tent (not in lexicon) skip hash for Test (not in lexicon) skip hash for teng (not in lexicon) skip hash for Tepe (not in lexicon) skip hash for Teun (not in lexicon) skip hash for tenen (not in lexicon) skip hash for tenne (not in lexicon) skip hash for Terne (not in lexicon) skip hash for Terre (not in lexicon) skip hash for Tente (not in lexicon) skip hash for tente (not in lexicon) skip hash for tense (not in lexicon) skip hash for Terra (not in lexicon) skip hash for Teers (not in lexicon) skip hash for tende (not in lexicon) skip hash for Tegen (not in lexicon) skip hash for Tetro (not in lexicon) skip hash for Testa (not in lexicon) skip hash for Teris (not in lexicon) skip hash for Teken (not in lexicon) skip hash for tenke (not in lexicon) skip hash for tenue (not in lexicon) skip hash for Telle (not in lexicon) skip hash for Tegel (not in lexicon) skip hash for Teems (not in lexicon) skip hash for Teije (not in lexicon) skip hash for Teijn (not in lexicon) skip hash for Tevel (not in lexicon) skip hash for Temme (not in lexicon) skip hash for Tewes (not in lexicon) skip hash for Temps (not in lexicon) skip hash for Teijl (not in lexicon) skip hash for Tewis (not in lexicon) skip hash for Texel (not in lexicon) examine 11040808032#134510866391,145551674423,146103607134,146767401175,151008562231,151112832692,151871924973,162009968294,162586402935,163279361716,163771377856,167423850592,169644978743,172160207307,173627210967,173921875588,175737858361,176635584510,183289486567,187260460871,192702844882,206562440468,232544001280 extract parts from 134510866391,145551674423,146103607134,146767401175,151008562231,151112832692,151871924973,162009968294,162586402935,163279361716,163771377856,167423850592,169644978743,172160207307,173627210967,173921875588,175737858361,176635584510,183289486567,187260460871,192702844882,206562440468,232544001280 analyze ngram candidates: Te_Hon_derd AND Te_Honderd after reduction, candidates: [Hon,derd] AND [Honderd] FOUND 1-2-3 Hon_derd Honderd ngram candidate: 'Hon_derd~Honderd' in n-grams pair: Te_Hon_derd # Te_Honderd stored: Hon_derd~Honderd and forget about Te_Hon_derd~Te_Honderd analyze ngram candidates: Te_Hon_derd AND Te_honderd after reduction, candidates: [Hon,derd] AND [honderd] FOUND 1-2-3 Hon_derd honderd ngram candidate: 'Hon_derd~honderd' in n-grams pair: Te_Hon_derd # Te_honderd stored: Hon_derd~honderd and forget about Te_Hon_derd~Te_honderd analyze ngram candidates: Te_Hon_derd AND te_Honderd after reduction, candidates: [Hon,derd] AND [Honderd] FOUND 1-2-3 Hon_derd Honderd ngram candidate: 'Hon_derd~Honderd' in n-grams pair: Te_Hon_derd # te_Honderd stored: Hon_derd~Honderd and forget about Te_Hon_derd~te_Honderd analyze ngram candidates: Te_Hon_derd AND te_honderd after reduction, candidates: [Hon,derd] AND [honderd] FOUND 1-2-3 Hon_derd honderd ngram candidate: 'Hon_derd~honderd' in n-grams pair: Te_Hon_derd # te_honderd stored: Hon_derd~honderd and forget about Te_Hon_derd~te_honderd ignoring Hon_derd~Honderd ignoring Hon_derd~honderd ignoring hon_derd~Honderd ignoring hon_derd~honderd reynaert@maize:/reddata/NATAR/TESTSAMPLE/ZIP$
Basically what I see, when comparing the stderr files of the runs with and without the unigram 'honderd' listed in the index, is that the result is that in the ldcalc output list the minimal solution: 'hon_derd' is to be corrected as 'honderd' is listed when present, is not listed when not.
However, both runs have basically done the same work and come to that same conclusion. That is like 'saying A'. And that should be said. [An aside for now is that the actual count of bigrams saying 'hon_derd' should be 'honderd' would be a valuable ranking feature. I am not clear at this very point in time whether we use this or not].
What we do not say so far is 'B', namely: if we say 'hon_derd' should be corrected as 'honderd' on the basis of the evidence provided by so many bigrams containing these word forms. we should also take the next step and say that those bigrams that contain 'hon_derd' cannot also have to be corrected as something els. This something else is identified on the basis of other character confusions and probably without fail represents a more complicated 'solution', e.g. often entailing differences between not just one part of the bigram pair but both pairs.
Saying 'B' would then mean not to 'forget' the actual bigrams evaluated and resolved, but to write these to a list of 'solved' bigrams. For reasons I will explain later, this list would be produced by TICCL-LDcalc for later use (most likely by TICCL-rank), not used within TICCL-LDcalc itself.
So, this modifies the request I made earlier, which was to not forget about the 'validated' bigrams but to output them. That would be an option, but the right way to me now seems to be to make proper use of the good work now actually already done by TICCL-LDcalc and to further cash in on that.
That would be to further down the line (i.e. in the next step) filter away all the spurious 'solutions' for each bi/trigram brought forward by the system. In the full run on the 2.5 million pages of National Archives 'Ysberg' data, for the single trigram 'te_hon_derd' - regardless of capitalization - this amounts to already 121 spurious solutions.
In so doing, I am confident we thoroughly narrow the search space and impose a highly valuable restriction on the total amount of work further to be done. I also think this will result in largely removing the need for our current module TICCL-chainclean, which in fact is meant to try and solve the very many problems created by not saying 'B' earlier on in the pipe line.
I will now detail why I ask for a separate list of 'solved' underscore/hyphen bi/trigrams.
The huge NA Ysberg corpus results in TICCL-indexer producing a huge index amounting to 371G. TICCL-LDcalc needs to keep everything in memory and my servers are limited to 256GB of RAM.
I found a good solution to be to split the index file on the basis of its lines, each one of which represents a singe character confusion. In fact, TICCL-indexer now also produces a *ConfStats file which for each character confusion details how many ngram pairs in the corpus were found to display the particular confusion. These numbers of pairs display a power law. And I found it to be viable to split the index file according to each power, resulting in e.g. a list containing the character confusions having hundreds of thousands of pairs, a next tens of thousands, the following just the thousands, etc.
TICCL-rank can then be run so many times on a manageable subset of the index file. This is viable because each character confusion proposes just its own CCs, i.e. they are all independent of each other. After running on all the subsets, the different output files can then be concatenated to be fed to TICCL-rank as one single large file.
It is prior to this, by means of a simple filtering script, that this list could be rid of all the bi/trigrams already solved by LDcalc or, if TICCL-rank could be modified towards this end, at the time of inputting the ldcalc-list to TICCL-rank.
It seems to me we have misguidedly imposed a restriction on TICCL-LDcalc to return higher ngram pairs where the variant and Correction Candidate (CC) only differ in a single (?) underscore (= space) or hyphen. I suppose I at some point expected this restriction to lighten TICCL's overall work load. The result is the later modules cannot converge on the best fitting resolution of the split word due to the contradiction between the unigram solution and those offered by the bi- or possibly trigrams. Ultimately, FoLiA-correct fails to find the right bi- and trigrams to correct.
Example LD-calc output:
We do not get the CC: 'is_honderd'.
This results in the bi/trigram correction never getting the most plausible resolution for split words, but still getting hundreds of less plausible Correction Candidates (CCs). This results in suboptimal ranking of the CCs and chaos further on in the pipeline, especially in TICCL-chainclean which on the current very large test on about 2.3 million pages of HTRed text now fails to make progress even after days.
We observe the same to be true for hyphens in ngramcorrections. See section 'Hyphens:' below.
This restriction is possibly implemented as simply as: for the confusion values for underscore or hyphen: do not return word pairs where the CC would be a bi- or trigram, i.e. only unigrams are allowed as CC. (This will probably not fully cover it...).
However implemented, I would now like to see the restriction removed.
The story, more in full, for both underscores and hyphens:
Underscores:
TICCL-rank currently correctly returns e.g. the unigram pair:
'Grep' on the ranked list:
The bigram, i.e. the split unigram, is correctly resolved. We also get two trigrams containing the bigram.
The CC for the first trigram 'Te_Hon_derd ' is 'nice' in light of the fact that we currently prefer what we now regard as the archaic form with 'ten' in Dutch. However, the more plausible form for these diachronic texts would have 'te', which has higher corpus frequencies (you need to subtract the artifrq '98765432' to get at the actual corpus frequencies):
For the second trigram ' Hon_derd_halve ' we see the actual bigram containing just 'halve' is here not returned by TICCL-LDcalc:
After TICCL-rank this results in:
But on higher ngram level and allowing for more character confusion than only an extra space (represented here as underscore)::
This results in chaos down the line, TICCL-chain and especially TICCL-chainclean fail to further resolve these contradictive results.
We see the same happening with hyphens
Our current corpus frequency list has the following bigrams::
versus:
Here too, TICCL-LDcalc does not return the most plausible CC:
We hope this can be remedied shortly! Thanks! MRE