LanguageMachines / ticcltools

Tools for TICCL
GNU General Public License v3.0
14 stars 3 forks source link

TICCL-chain does not complete the job #37

Closed martinreynaert closed 4 years ago

martinreynaert commented 5 years ago

For some reason, TICCL-chain fails to chain all the variants.

I have prepared a small test file, which I will attach here. It's name is: EDBO.parelvisserij.ranklist.ranked.txt

You find current TICCL-chain output in file: EDBO.parelvisserij.ranklist.ranked.CHAIN.chained.txt

The following word forms in the output still appear both as variant and as correction candidate. That should not happen. reynaert@violet:/opensonar/EDBO/TICCLAT$ cat Parelchained.L1.txt Parelchained.L2.txt |sort |uniq -c |grep ' 2 ' 2 paarel-visfchers 2 paarlvisfcherij 2 Paerlvisfcherij

You can ignore the first of these, the data necessary to better chain that one simply is not in the test file.

The two others have this result in 'chained' (I grep for the last one first. That in fact gives the 'solution' for the second: reynaert@violet:/opensonar/EDBO/TICCLAT$ grep -i --color 'Paerlvisfcherij' EDBO.parelvisserij.ranklist.ranked.CHAIN.chained Paerlvisfcherij#2#parelvisscherij#100000000#12766817294#3#C paerlvisfcherij#2#parelvisscherij#100000000#12766817294#3#C Paarlvisfcherij#1#Paerlvisfcherij#2#2432776564#1#C Paarlvisfcherijen#1#Paerlvisfcherij#2#26192046331#3#C Paarlvisfchrrij#1#Paerlvisfcherij#2#3602851446#2#C paarlvisfcherij#2#Paerlvisfcherij#2#2432776564#1#C paerl-visfcherij#2#Paerlvisfcherij#2#35723051649#1#C

reynaert@violet:/opensonar/EDBO/TICCLAT$ grep -i --color 'paarlvisfcherij' EDBO.parelvisserij.ranklist.ranked.CHAIN.chained Paarlvisfcherij#1#Paerlvisfcherij#2#2432776564#1#C Paarlvisfcherijen#1#Paerlvisfcherij#2#26192046331#3#C paarlvisfcherij#2#Paerlvisfcherij#2#2432776564#1#C paarlvisfeherij#1#paarlvisfcherij#2#13290459257#1#C paarlvjsfcheri#1#paarlvisfcherij#2#14693280768#2#C

Please look into this.

Martin

EDBO.parelvisserij.ranklist.ranked.CHAIN.chained.txt EDBO.parelvisserij.ranklist.ranked.txt

kosloot commented 5 years ago

To reproduce this, I need the alphabet file too. Also: did you run chain with or without --caseless ?

martinreynaert commented 5 years ago

Here's the alphabet file:

nld.aspell.dict.clip20.lc.chars.txt

(Github did not want to take the extension chars, I added .txt)

And yes, I did run with --caseless.

kosloot commented 5 years ago

Thanx. Yes I know about the filenames.

martinreynaert commented 5 years ago

For the record: for now it is the intention that all non-word forms get aligned to their most likely correct historical forms and not to the contemporary lemma, in this case: 'parelvisserij'. All the known diachronic forms are linked to that in the TICCLAT database.

martinreynaert commented 5 years ago

I seem to get the exact same result without --caseless:

reynaert@violet:/opensonar/EDBO/TICCLAT$ cat ParelL1caseless ParelL2caseless |sort |uniq -c |grep ' 2 ' 2 paarel-visfchers 2 paarlvisfcherij 2 Paerlvisfcherij

kosloot commented 5 years ago

So you got me confused again.... The output of Ticcl-chain looks totally different from what what you call ParelL1caseless etc. So which trickery did you do there? And is it even important.? Maybe it is better to explain what you did expect to come out??

What should be linked to what? form these greps it is quite unclear

BTW: caseless only affects the LD calculations used on output of the results

martinreynaert commented 5 years ago

OK, I have a clearer example.

The ranked file has just 5 pairs. They are in the order they are in the larger *ranked file I took these from (which is sorted descendingly on the frequency of the correction candidates).

mres-MacBook-Pro:EDBO mre$ cat VOC.nauwkeurig.ranked.txt naankeurig#1#nauwkeurig#100000002#24127620869#2#1 nauwkourig#1#nauwkeurig#100000002#4512359257#1#0.911565 nauwheurig#1#nauwkeurige#100000001#7969308999#2#1 namwkeung#1#namwkeurig#1#15289567369#2#1 namwkeurig#1#nauwheurig#1#2702367963#2#1

The outcome is:

mres-MacBook-Pro:EDBO mre$ cat VOC.nauwkeurig.ranked.CHAIN.chained.txt naankeurig#1#nauwkeurig#100000002#24127620869#2#C nauwkourig#1#nauwkeurig#100000002#4512359257#1#C namwkeurig#1#nauwkeurige#100000001#10671676962#2#C nauwheurig#1#nauwkeurige#100000001#7969308999#2#C namwkeung#1#namwkeurig#1#15289567369#2#C

De grep before looks at the overlap between columns 1 and 3. There should not be any overlap left in *chained.

There appears to be sth. wrong in the necessary recursiveness of TICCL-chain. In so far that 'namwkeurig' has been linked to 'nauwkeurige', it follows that its variant 'namwkeung' (and all others that might be there) should also be linked to 'nauwkeurige' in the end.

Hope this helps!

VOC.nauwkeurig.ranked.CHAIN.chained.txt VOC.nauwkeurig.ranked.txt

kosloot commented 4 years ago

assume this is fixed now