WIP: Additional evaluators

Adamits commented 1 month ago

Addresses #183

I was working on evaluating with character error rate, and went down a small rabbit hole of how to support multiple eval metrics. I am very open to different proposals, especially given that I was not thinking about #175 when writing this (we may need to resolve the semantics of this, at the very least). Basically, the idea is that every eval metric should have some definition of a per-sample score (for accuracy this is binary), and we simply collect a list of these across samples, and again across batches, before reporting a micro-average.

There is some controller code that may be a bit clunky right now that entails reading the requested metrics from the command line, and then initializing those evaluators on the model. Then, during validation, we just loop over the requested evaluators and aggregate their metric.

CER

In doing this, I also implemented a CEREvaluator. This is currently a bit rough---I just used torchmetrics for the actual metric, which I believe is all in python. Additionally, I did not bother decoding gold/predicted tensors back into strings and instead convert the list of integers into a string, where each integer is white-space seperated byt he next integer. I think this works fine conceptually, but of course we cannot ensure then that this is really character error rate---it is essentially whatever-you-define-one-symbol-to-be error rate, which in my case happens to be character (or phoneme).

@kylebgorman I was hoping you might have strong opinions on how we need to implement cer (we probably need to convert tensors to strings at each eval call, but if there are tricks s.t. we do not have to that would be great).

Please also feel free to comment on the general framework here. Even just critique with no alternate proposal would be very helpful as I just basically followed how we have been doing eval, but I am happy to rethink the system at a higher level.

Adamits commented 1 month ago

Thanks for the comments.

I was writing you an email about related things, but might move that to here as it relates to some of your comments. Probably won't get back to addressing this until tomorrow though.

Adamits commented 1 month ago

as many --eval_metrics as are available and you want (default: just accuracy, other option is CER and you can have both)

one --checkpoint_metric (default: accuracy, but why not add CER to that too in addition to loss?)

one --reduceonplateau_metric (default: loss, other option accuracy)

one --patience_metric (default: loss, other option accuracy)

Any issues there? I put some notes below on the implementation.

You're right, it should work fine!

Adamits commented 1 month ago

I updated how we compute CER a bit and accidentally left it in an unintuitively named commit.

I just import the torchmetrics edit distance computation now (we could replace it with something faster), and then define a _compute_cer that calls it once. This avoids the unnecessary overhead of converting to strings, and always expects a single comparison at once.

kylebgorman commented 1 month ago

I think numpy’s mean is fine. Anything to reduce complexity in EvalItem…

Adamits commented 1 month ago

Ok I think I have addressed the changes.

Later I can work on another PR to wire up an option for decoding symbols at validation. This will support e.g. actual CER if you have >character-sized symbols. I think it will also be nice from a developer standpoint: if I want to use yoyodyne for my task and need to implement a new eval, I can just ask for characters and implement it in a straightforward way rather than needing to take the time to figure out how to make pytorch tensors do what I want.

kylebgorman commented 1 month ago

Ok I think I have addressed the changes.

Later I can work on another PR to wire up an option for decoding symbols at validation. This will support e.g. actual CER if you have >character-sized symbols. I think it will also be nice from a developer standpoint: if I want to use yoyodyne for my task and need to implement a new eval, I can just ask for characters and implement it in a straightforward way rather than needing to take the time to figure out how to make pytorch tensors do what I want.

My suggestion there is to just "".join(symbols) since it feels like separators ought not to count for any sensible application of CER?

Adamits commented 1 month ago

Ok I think I have addressed the changes. Later I can work on another PR to wire up an option for decoding symbols at validation. This will support e.g. actual CER if you have >character-sized symbols. I think it will also be nice from a developer standpoint: if I want to use yoyodyne for my task and need to implement a new eval, I can just ask for characters and implement it in a straightforward way rather than needing to take the time to figure out how to make pytorch tensors do what I want.

My suggestion there is to just "".join(symbols) since it feels like separators ought not to count for any sensible application of CER?

Right but we still need to get symbols somewhere (currently we just turn ints into chars---but if an int represents several chars we don't get CER), and I think it makes sense to not do this once per evaluator, but maybe it doesn't matter. Still, at least it requires giving evaluators access to the index.

kylebgorman commented 1 month ago

Yea, instead of = 1 it should be range(1, …). I tested and confirmed.

On Fri, May 3, 2024 at 12:57 PM Adam @.***> wrote:

@.**** commented on this pull request.

In yoyodyne/evaluators.py https://github.com/CUNY-CL/yoyodyne/pull/184#discussion_r1589461841:

import torch from torch.nn import functional

+# from torchmetrics.text import CharErrorRate +from torchmetrics.functional.text.helper import _edit_distance

Ok as a sanity check, if I do this instead:

table = numpy.zeros((idim, jdim), dtype=numpy.uint16)

table[1:, 0] = 1
    # table[0, 1:] = 1
    for i in range(idim):
        table[i][0] = i
    for j in range(jdim):
        table[0][j] = j
I reproduce the results. Off the top of my head I think this one is right, and your implementation might be for LCS? I will try to find references...

— Reply to this email directly, view it on GitHub https://github.com/CUNY-CL/yoyodyne/pull/184#discussion_r1589461841, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OPSOLQHANAFVO7LPBDZAO6VZAVCNFSM6AAAAABHCCNHM6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAMZYGYZTQNJUGM . You are receiving this because you were mentioned.Message ID: @.***>

Adamits commented 1 month ago

LGTM.

I made one minor change: I made _edit_distance a static method. Could you do a very quick check to make sure that doesn't break anything before I merge?

Thanks! Ran my test and it seems to work fine.

CUNY-CL / yoyodyne

WIP: Additional evaluators #184

CER

@.**** commented on this pull request.

table[1:, 0] = 1