LanguageMachines / ticcltools

Tools for TICCL
GNU General Public License v3.0
14 stars 3 forks source link

Request for new ranking feature based on pairs1-rank, possibly to replace pairs_combined_rank: MedianPairsCCFrequencies #33

Open martinreynaert opened 5 years ago

martinreynaert commented 5 years ago

Hi,

This concerns ranking features:

  (skip[8]?0:(*vit)->pairs1_rank) +

  (skip[10]?0:(*vit)->pairs_combined_rank) +

This is a request for a more informed ranking-feature. This may be a new one or may replace the existing pairs_combined one (preferred).

Ranking feature pairs1 currently takes the count of each anagram confusion value of the pairs transferred from LDcalc to rank. Highest number of pairs transferred ranks highest in rank, given a particular set of Correction Candidates for a particular variant.

This does not always result in the most likely CC given the highest rank in the current situation. Quite spurious confusions over particularly shorter words may be ranked higher than ostensibly often recurring confusions given the particular corpus being corrected.

After some experimentation it seems that weighing the frequencies of the CCs proposed for a particular confusion might help. We have tried the mean of the frequencies, but this results in pretty much the same ranking as we currently get in pairs1.

The median of the CCs frequencies, however, appears more likely to deliver the better ranking.

This will probably have to be implemented at the end of rank.

So, given the overall set of pairs in rank that share a particular character confusion value, this new feature needs to calculate the median of the CCs frequencies (their own, not the summed frequency of their capitalised versions). Also, here, the highest median wins, i.e. is accorded rank 1.

I would very much like to be be able to experiment with this soon.

Thanks!

M.

kosloot commented 5 years ago

I am a bit confused about your remark This will probably have to be implemented at the end of rank. Does this mean that you suggest to calculate the median of all frequencies belonging to a character confusion, for all variants it appears in?

My first impression was, that it is a 'local' calculation, for 1 variant with its N CC's e.g. consider this variant:

-eveuzoo~1~1~Eveu_zoo~1~2~25723051649~2~6~0~0~1~0~0
-eveuzoo~1~1~Eveuzoo~28~95~35723051649~1~7~0~0~1~0~0
-eveuzoo~1~1~Kveuzoo~2~2~28061646568~2~6~0~0~1~0~0
-eveuzoo~1~1~evenzoo~100002930~100004079~44559939201~2~6~1~0~1~0~2
-eveuzoo~1~1~eveozoo~2~2~40621368225~2~6~0~0~1~0~0
-eveuzoo~1~1~eveu_zoo~1~2~25723051649~2~6~0~0~1~0~0
-eveuzoo~1~1~eveuzoo~67~95~35723051649~1~7~0~0~1~0~0
-eveuzoo~1~1~geve_zoo~100000003~100000003~28302116432~2~6~1~0~1~0~0

this has the frequencies: 1 1 2 3 28 67 100000003 100002930 The median would be 15.5, which seems quite useless. So this is apparently NOT what you want.

Could you clarify a bit?

martinreynaert commented 5 years ago

No, the local calculation is not what I want. I do suggest to calculate the median of all frequencies belonging to a character confusion, for all variants it appears in.

OK, for my tests is have used the following information:

reynaert@red:/reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL$ grep '#9496960451#' .RUNAMALGAM5.clean.ldcalc.debug.ranked | cut -d '#' -f 1,3,4,6,16 >bla3 reynaert@red:/reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL$ grep '#745481551#' .RUNAMALGAM5.clean.ldcalc.debug.ranked | cut -d '#' -f 1,3,4,6,16 >bla4

I have imported these output files in Excel and have calculated the average/mean and median over column 3 of this output, i.e. the base frequency for the CCs.

So I based this on the info in the debug file output by TICCL-rank.

I you do this on the output of LDcalc, you get larger subsets per confusion value. So extra filtering in TICCL-rank seems to discard a number of pairs, so we loose some (I hope we do not actually lose some). It would probably be easier to calculate the mean over these from Ldcalc. Who knows the net result might be the same, but I do not know this. Let us say this is an option if it proves too hard to implement this on the subsets actually output to the debug file of rank.

Hope this sufficiently clarifies matters.

kosloot commented 5 years ago

Ok, calculation the median per confusion value is a simple preprocession step on the LDcalc data. On only the results stored in Rank, it would require a post-procession step, which might be more expensive. I suggest to start using the LDcalc data, and see what that brings us. We use that global value then in ranking the CC's per variant

martinreynaert commented 5 years ago

First test on server Black running with command-line:

reynaert@black:/reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL$ nohup /exp/sloot/usr/local/bin//TICCL-rank -t max --alph /reddata/PILOTS/MORSE/Aspell/eng.aspell.hyphen.dict.clip0.lc.chars --charconf /reddata/PILOTS/MORSE/Aspell/eng.aspell.hyphen.dict.clip0.ld2.charconfus -o /reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL/RUNAMALGAM5.wordfreqlist.1to3.tsv.tsv.clean.ldcalc.subtractartifrqfeature1.MEDIAN.ranked --debugfile /reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL/.RUNAMALGAM5.tsv.clean.ldcalc.subtractartifrqfeature1.MEDIAN.debug.ranked --subtractartifrqfeature1 1000000000 --clip 1 --skipcols=9,10,13 --charconfreq /reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL/RUNAMALGAM5.wordfreqlist.1to3.tsv.tsv.clean.ldcalc.subtractartifrqfeature1.ranked.chrconfreq /reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL/RUNAMALGAM5.wordfreqlist.1to3.tsv.clean.ldcalc 2>/reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL/RUNAMALGAM5.RANK.subtractartifrqfeature1.charconfreq.MEDIAN.20181204.stderr &

kosloot commented 5 years ago

@martinreynaert Small addition: Ik meldde: ls je (op black:) TICCL-rank draait met de --ALTERNATIVE optie dan berekent ie de mediaan alleen voor de frequentie van de gevonden CC's per variant.

Ik zie (minimale) verschillen.

Graag hoor ik welke benadering we gaan kiezen