LanguageMachines / ticcltools

Tools for TICCL
GNU General Public License v3.0
14 stars 3 forks source link

TICCL-LDcalc output of frequency draw word pairs #42

Open martinreynaert opened 4 years ago

martinreynaert commented 4 years ago

In TICCL-LDcalc it may happen that the frequencies of words in a retrieved pair are the same.

In the case of such a draw, it is actually more likely (for diverse reasons) that the word form having the larger anagram value is the 'variant' and the one having the lower one the 'correction candidate'. Please output these accordingly.

Thank you!

kosloot commented 4 years ago

I checked the code and it is not as easy as I thought. The anagram value is at the moment only available for one of the 2 words. So that should be added for the other word too. Which makes me wonder if is a mistake in the current code as we do swap the variant and the CC when the frequency is smaller. Shouldn't we swap the hashes too then? (The anagram hash value for the variant is stored in de LDcalc output file as field 7. after swapping it is in fact the hash for the CC)

martinreynaert commented 4 years ago

Hi Ko,

The value in field 7 is in fact the numerical difference between the Anagram Values of the pair. It is a value from the character confusion list produced on the basis of the alphabet by TICCL-lexstat and stands for a difference (usually) of just two characters, at most. TICCL-indexer(NT) attaches to these character confusion values the lower value of any word pair (in fact: set of word anagrams) identified.

So TICCL-LDcalc reads in these character confusion values (column 1 in TICCL-indexer output) and for each of them picks the attached values (which are the lower ones) to retrieve from the anahash the set of word anagrams, i.e. the word(s), associated to this value and pairs them to the other set of word(s) also retrieved from the anahash. This retrieval is done on the basis of the sum of the character confusion value with the associated (lower) word anagram value. So the result of this addition, i.e. sum, gives the value for the higher AV.

At least at the start of LDcalc, you therefore have both the lower and higher values at hand.

LDcalc next proceeds to look at the associated words frequencies etc.

Hope this helps!

Martin

kosloot commented 4 years ago

I tried to implement this and installed the fix on maize and violet. I see no differences in results though. So OR I made a mistake, OR my testset is inadequate. @martinreynaert would you please check it? And when it doesn't work, provide me with an example of an entry that should be 'reversed'

martinreynaert commented 4 years ago

Hi Ko,

Thank you!

I see no difference on maize between the LDcalc-output of the previous version and the current one either:

reynaert@maize:/reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT$ diff GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.ldcalc GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.LDCALC.ldcalc
reynaert@maize:/reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT$

And I re-ran the same on violet: no difference with the new maize output there either:

reynaert@violet:~$ diff /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.ldcalc /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.VIOLET.ldcalc
reynaert@violet:~$

So something did not work as expected.

To try and help solve this, I extracted the hapaxes from the corpus frequency list and ran TICCL-LDcalc only on that. So that, artificially, reproduces nothing but draws.

This here is the command line:

reynaert@violet:~$ nohup /exp/sloot/usr/local/bin/TICCL-LDcalc --threads 124 --LD 2 --low=5 --high=35 --index /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.INDEX.index --hash=/reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.ANAHASH.anahash --clean /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.UNK.UnigramHapaxesOnly.clean --alph=/reddata/POLMASH/TRI/ALPH/nld.aspell.dict.clip20.lc.chars --artifrq 98765432 -o /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.VIOLET.UnigramHapaxesOnly > /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.20200531.VIOLET.UnigramHapaxesOnly.stdout 2> /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.20200531.VIOLET.UnigramHapaxesOnly.stderr &

We select 4 examples from the tail of the output:

reynaert@violet:~$ grep 'vryffsteen' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.VIOLET.UnigramHapaxesOnly.ldcalc
vryfsteen~1~1~vryffsteen~1~1~28153056843~1~9~0~1~1~0~0

reynaert@violet:~$ grep 'walvisbeenen' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.VIOLET.UnigramHapaxesOnly.ldcalc
walvisbenen~1~1~walvisbeenen~1~1~11592740743~1~11~0~1~1~0~0

reynaert@violet:~$ grep 'wereltskaert' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.VIOLET.UnigramHapaxesOnly.ldcalc
wereltkaart~1~1~wereltskaert~1~1~12953462985~2~10~0~1~1~0~0

reynaert@violet:~$ grep '^zeestucxke' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.VIOLET.UnigramHapaxesOnly.ldcalc
zeestucxken~1~1~zeestucxkien~1~1~14693280768~1~11~0~1~1~0~0

Their AVs:

reynaert@violet:~$ grep 'vryfsteen' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.ANAHASH.anahash
162731772280~vryfsteen
reynaert@violet:~$ grep 'vryffsteen' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.ANAHASH.anahash
190884829123~vryffsteen

reynaert@violet:~$ grep 'walvisbenen' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.ANAHASH.anahash
187591370533~walvisbenen
reynaert@violet:~$ grep 'walvisbeenen' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.ANAHASH.anahash
199184111276~walvisbeenen

reynaert@violet:~$ grep 'wereltkaart' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.ANAHASH.anahash
168947636901~wereltkaart
reynaert@violet:~$ grep 'wereltskaert' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.ANAHASH.anahash
181901099886~wereltskaert

reynaert@violet:~$ grep 'zeestucxken' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.ANAHASH.anahash
204409956510~Zeestucxken#zeestucxken
reynaert@violet:~$ grep 'zeestucxkien' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.OrderOfMagnitude.ANAHASH.anahash
219103237278~zeestucxkien

The word forms nearer to the modern canonical form (if there is or would be such) are consistently the lower AV forms.

I would very much like to see the output reversed!

Thanks!

Martin

kosloot commented 4 years ago

So I made a small change ON MAIZE ONLY!. Now the results ARE reversed. But I wonder if this working out well in general for all wordpairs Maybe we need to look a bit deeper in the code. But at least you can test this.

Happy testing

martinreynaert commented 4 years ago

Many thanks, Ko!

Sure I will test this!!!

Starting it up right now ;0)

M.

martinreynaert commented 4 years ago

Yes! They're all reversed now :0)

reynaert@maize:/reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT$ grep 'walvisbeenen' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.2.ldcalc
walvisbeenen~1~1~walrusbeenen~98765432~98765432~6864473168~2~10~1~1~1~0~0
walvisbeenen~1~1~walvis-beenen~98765432~98765432~35723051649~1~12~1~1~1~0~0
walvisbeenen~1~1~walvis_beene~98765433~98765433~1125720992~2~10~1~1~0~0~0
walvisbeenen~1~1~walvis_beenen~98765433~98765433~11040808032~1~12~1~1~1~0~1
walvisbeenen~1~1~walvisbeen~100000004~100000004~23759269767~2~10~1~1~1~0~2
walvisbeenen~1~1~walvisbeene~2~2~12166529024~1~11~0~1~0~0~0
walvisbeenen~1~1~walvisbeerden~98765432~98765432~18219703433~2~11~1~1~1~0~0
walvisbeenen~1~1~walvisbenen~1~1~11592740743~1~11~0~1~1~0~4
walvisbeenen~1~1~walvisbiene~2~2~9065988999~2~10~0~1~0~0~0
walvisbeenen~1~1~walvischbeenen~98765432~98765432~47760777568~2~12~1~1~1~0~0
reynaert@maize:/reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT$ grep 'vryffsteen' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.2.ldcalc
vryff_steen~1~1~vryffsteen~1~1~11040808032~1~10~0~1~1~0~0
vryffsteen~1~1~graffsteen~98765432~98765432~25245524877~2~8~1~0~1~0~0
vryffsteen~1~1~vrijfsteen~98765438~98765438~18190663819~2~8~1~1~1~0~0
vryffsteen~1~1~vryfsteen~1~1~28153056843~1~9~0~1~1~0~1
vryffsteen~1~1~wryffsteen~2~2~3378826023~1~9~0~0~1~0~1
vryffsteen~1~1~wryfsteen~98765432~98765432~24774230820~2~8~1~0~1~0~0
reynaert@maize:/reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT$ grep 'wereltskaert' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.2.ldcalc
wereltskaert~1~1~Werelt_kaert~98765433~98765456~4345431517~1~11~1~1~1~0~0
wereltskaert~1~1~wereldt_kaert~98765434~98765434~13277985315~2~11~1~1~1~0~1
wereltskaert~1~1~werelt_caert~98765439~98765439~1283622659~2~10~1~1~1~0~1
wereltskaert~1~1~werelt_kaert~98765455~98765456~4345431517~1~11~1~1~1~0~2
wereltskaert~1~1~wereltcaert~2~2~9757185373~2~10~0~1~1~0~0
wereltskaert~1~1~wereltkaart~1~1~12953462985~2~10~0~1~1~0~0
wereltskaert~1~1~werelts-kaerte~98765432~98765432~47315792392~2~12~1~1~0~0~0
wereltskaert~1~1~werelts_caert~98765434~98765434~16669862208~2~11~1~1~1~0~1
wereltskaert~1~1~werelts_kaart~98765433~98765433~13473584596~2~11~1~1~1~0~0
reynaert@maize:/reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT$ grep 'zeestucxkien' /reddata/GoldenAgents/Montias/Getty_Frick/TICCLTICCLAT/GettyFrick.Contents.Dutch.wordfreqlist.1to3ngrams.LDCALC.reversedraw.2.ldcalc
zeestucxkien~1~1~Zeestucxken~2~3~14693280768~1~11~0~1~1~0~0
zeestucxkien~1~1~seestuckien~1~1~48169707983~2~10~0~0~1~0~0
zeestucxkien~1~1~zee_stuckien~2~2~21997561375~2~10~0~1~1~0~0
zeestucxkien~1~1~zee_stucxken~3~3~3652472736~2~10~0~1~1~0~0
zeestucxkien~1~1~zeestucken~6~6~47731650175~2~10~0~1~1~0~0
zeestucxkien~1~1~zeestuckgen~2~2~29307298382~2~10~0~1~1~0~0
zeestucxkien~1~1~zeestuckie~4~4~45204898431~2~10~0~1~0~0~0
zeestucxkien~1~1~zeestuckien~4~4~33038369407~1~11~0~1~1~0~2
zeestucxkien~1~1~zeestuckiens~2~2~17652129858~2~10~0~1~0~0~0
zeestucxkien~1~1~zeestuckies~4~4~29818658882~2~10~0~1~0~0~0
zeestucxkien~1~1~zeestuckijen~3~3~6011287775~2~10~0~1~1~0~0
zeestucxkien~1~1~zeestucxken~1~3~14693280768~1~11~0~1~1~0~2
zeestucxkien~1~1~zeestucxkens~2~2~692958781~2~10~0~1~0~0~0

Btw, these are all words from Dutch 'Golden Age' notarial descriptions of house inventories about paintings. A 'zeestucxkien' would have been a small painting depicting a sea scene.

Will now run the full thing ;0)

kosloot commented 4 years ago

Nice this looks good. My doubts are after running some test on a small dataset where I see reversions like:

verllooren~1~1~verftooren~1~1~7579697257~2~8~0~1~1~0~0

to

verftooren~1~1~verllooren~1~1~7579697257~2~8~0~1~1~0~0

and:

ANSCHE~1~1~ganfche~1~1~29957217696~2~5~0~0~1~0~0

To

ganfche~1~1~ANSCHE~1~1~29957217696~2~5~0~0~1~0~0

Which doesn't look like progress to me, and the complete removal of:

C^sars~1~1~Cssars~1~1~4183180267~1~5~0~1~1~0~0
ai.der~1~1~aonder~1~1~2953462985~2~4~0~1~1~0~0

As the left sides are 'out of the lexicion' and deleted after reversal. Which may be is a good thing after all.

On a side note: shouldn't the --alph option be made mandatory for LDcalc? It isn't at the moment and allows LDcalc to create correction to non-alphabet words

kosloot commented 4 years ago

reminder for @martinreynaert : On a side note: shouldn't the --alph option be made mandatory for LDcalc? It isn't at the moment and allows LDcalc to create correction to non-alphabet words