emanjavacas / pie

A fully-fledge PyTorch package for Morphological Analysis, tailored to morphologically rich and historical languages.
MIT License
22 stars 10 forks source link

Rounding of threshold can be a bit weird #39

Closed PonteIneptique closed 4 years ago

PonteIneptique commented 4 years ago

Hi there, I have just seen a weird situation, which probably goes down to rounding stuff up:

Epoch 20

pos

accuracy precision recall support
all 0.9769 0.9287 0.9086 4147
unknown-tokens 0.9198 0.8135 0.8348 187
ambiguous-tokens 0.9376 0.9008 0.8695 930
<TaskScheduler patience="5" factor="0.5" threshold="0" min_weight="0">
    <Task name="pos" steps="0" patience="6" threshold="0.001" target="True" 
            mode="max" weight="1.0" best="0.9769"/>
</TaskScheduler>
<LrScheduler lr="0.00056" lr_steps="0" lr_patience="2"/>

Epoch 22

accuracy precision recall support
all 0.9776 0.929 0.9231 4147
unknown-tokens 0.9144 0.7081 0.7366 187
ambiguous-tokens 0.9409 0.9097 0.8829 930
<TaskScheduler patience="5" factor="0.5" threshold="0" min_weight="0">
    <Task name="pos" steps="2" patience="6" threshold="0.001" target="True" mode="max" 
            weight="1.0" best="0.9769"/>
</TaskScheduler>
<LrScheduler lr="0.00042" lr_steps="2" lr_patience="2"/>

Bug ?

I post it here for later review, but I feel like if this rounds up, 0.9776 should still beat 0.9769 (978 vs 977) but who knows :)

emanjavacas commented 4 years ago

This must be the effect of threshold. We are checking whether the new metric is better than a previous one by at least the given threshold. If not the new score gets ignored (and thus it doesn't register). Perhaps a default of 0.001 is too strict.

PonteIneptique commented 4 years ago

Thanks ! I think you are right.