Open pietrolesci opened 2 years ago
cc @stancld opinion on this?
I'm not so familiar with this kind of metrics.. How much do these metrics differ from standard classification ones? :] @pietrolesci
Hi @stancld,
I think it's not much different. The convenience of having sequence-level metrics already available is that
they can be fed sequences directly (without manual iteration)
can implement different evaluation "policies": "strict" vs non strict. For example
pred: [A, A, B]
true: [A, B, B]
can be considered partially correct or incorrect. This, of course, has an effect on how results are aggregated. An practical example in the README.md.
it can be easier to enforce particular encodings for the NER or POS tags (for example)
last but not least, it would be nice to have it in torchmetrics for consistency (i.e., no need to resort to other libraries/frameworks)
Hi @pietrolesci, I get the motivation and think this might be a nice contribution to torchmetrics
. 👍
As these metrics will be very likely inherited from the classification ones, I'd just wait a bit with this addition for the finalization of the classification refactor currently ongoing #1001 :]
Hi @pietrolesci -- I think I should be able to find some time in the near future to have a look at this class of metrics. However, I'm not fully familiar with the current state of tagging metrics. Do you think it will make more sense if our public API will accept something like Sequence[Sequence[str]
, or it's better to use torch.Tensor
here? (I think transformers
models tend to output tensors, so it would make sense as well). Also, we can support both options and make sure everything is converted to tensors internally (considering this won't be too much confusing at our public api). What do you think? :]
cc: @Borda @SkafteNicki
I think it would be good to explore this direction; also we can set a quick call with @pietrolesci to get more context, and maybe he could give us some intro... :rabbit:
🚀 Feature
Support for sequence tagging evaluation metrics à la
seqeval
. That is, support the evaluation of the performance of chunking tasks such as named-entity recognition, part-of-speech tagging, semantic role labeling and so on.