Closed martinreynaert closed 4 years ago
To reproduce this, I need the alphabet file too. Also: did you run chain with or without --caseless ?
Here's the alphabet file:
nld.aspell.dict.clip20.lc.chars.txt
(Github did not want to take the extension chars, I added .txt)
And yes, I did run with --caseless.
Thanx. Yes I know about the filenames.
For the record: for now it is the intention that all non-word forms get aligned to their most likely correct historical forms and not to the contemporary lemma, in this case: 'parelvisserij'. All the known diachronic forms are linked to that in the TICCLAT database.
I seem to get the exact same result without --caseless:
reynaert@violet:/opensonar/EDBO/TICCLAT$ cat ParelL1caseless ParelL2caseless |sort |uniq -c |grep ' 2 ' 2 paarel-visfchers 2 paarlvisfcherij 2 Paerlvisfcherij
So you got me confused again.... The output of Ticcl-chain looks totally different from what what you call ParelL1caseless etc. So which trickery did you do there? And is it even important.? Maybe it is better to explain what you did expect to come out??
What should be linked to what? form these greps it is quite unclear
BTW: caseless only affects the LD calculations used on output of the results
OK, I have a clearer example.
The ranked file has just 5 pairs. They are in the order they are in the larger *ranked file I took these from (which is sorted descendingly on the frequency of the correction candidates).
mres-MacBook-Pro:EDBO mre$ cat VOC.nauwkeurig.ranked.txt naankeurig#1#nauwkeurig#100000002#24127620869#2#1 nauwkourig#1#nauwkeurig#100000002#4512359257#1#0.911565 nauwheurig#1#nauwkeurige#100000001#7969308999#2#1 namwkeung#1#namwkeurig#1#15289567369#2#1 namwkeurig#1#nauwheurig#1#2702367963#2#1
The outcome is:
mres-MacBook-Pro:EDBO mre$ cat VOC.nauwkeurig.ranked.CHAIN.chained.txt naankeurig#1#nauwkeurig#100000002#24127620869#2#C nauwkourig#1#nauwkeurig#100000002#4512359257#1#C namwkeurig#1#nauwkeurige#100000001#10671676962#2#C nauwheurig#1#nauwkeurige#100000001#7969308999#2#C namwkeung#1#namwkeurig#1#15289567369#2#C
De grep before looks at the overlap between columns 1 and 3. There should not be any overlap left in *chained.
There appears to be sth. wrong in the necessary recursiveness of TICCL-chain. In so far that 'namwkeurig' has been linked to 'nauwkeurige', it follows that its variant 'namwkeung' (and all others that might be there) should also be linked to 'nauwkeurige' in the end.
Hope this helps!
VOC.nauwkeurig.ranked.CHAIN.chained.txt VOC.nauwkeurig.ranked.txt
assume this is fixed now
For some reason, TICCL-chain fails to chain all the variants.
I have prepared a small test file, which I will attach here. It's name is: EDBO.parelvisserij.ranklist.ranked.txt
You find current TICCL-chain output in file: EDBO.parelvisserij.ranklist.ranked.CHAIN.chained.txt
The following word forms in the output still appear both as variant and as correction candidate. That should not happen. reynaert@violet:/opensonar/EDBO/TICCLAT$ cat Parelchained.L1.txt Parelchained.L2.txt |sort |uniq -c |grep ' 2 ' 2 paarel-visfchers 2 paarlvisfcherij 2 Paerlvisfcherij
You can ignore the first of these, the data necessary to better chain that one simply is not in the test file.
The two others have this result in 'chained' (I grep for the last one first. That in fact gives the 'solution' for the second: reynaert@violet:/opensonar/EDBO/TICCLAT$ grep -i --color 'Paerlvisfcherij' EDBO.parelvisserij.ranklist.ranked.CHAIN.chained Paerlvisfcherij#2#parelvisscherij#100000000#12766817294#3#C paerlvisfcherij#2#parelvisscherij#100000000#12766817294#3#C Paarlvisfcherij#1#Paerlvisfcherij#2#2432776564#1#C Paarlvisfcherijen#1#Paerlvisfcherij#2#26192046331#3#C Paarlvisfchrrij#1#Paerlvisfcherij#2#3602851446#2#C paarlvisfcherij#2#Paerlvisfcherij#2#2432776564#1#C paerl-visfcherij#2#Paerlvisfcherij#2#35723051649#1#C
reynaert@violet:/opensonar/EDBO/TICCLAT$ grep -i --color 'paarlvisfcherij' EDBO.parelvisserij.ranklist.ranked.CHAIN.chained Paarlvisfcherij#1#Paerlvisfcherij#2#2432776564#1#C Paarlvisfcherijen#1#Paerlvisfcherij#2#26192046331#3#C paarlvisfcherij#2#Paerlvisfcherij#2#2432776564#1#C paarlvisfeherij#1#paarlvisfcherij#2#13290459257#1#C paarlvjsfcheri#1#paarlvisfcherij#2#14693280768#2#C
Please look into this.
Martin
EDBO.parelvisserij.ranklist.ranked.CHAIN.chained.txt EDBO.parelvisserij.ranklist.ranked.txt