[Discussion] What Metrics do we want?

Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.

https://lightning.ai

Apache License 2.0

28.18k stars 3.37k forks source link

[Discussion] What Metrics do we want? #1275

Closed justusschock closed 4 years ago

justusschock commented 4 years ago

As discussed in #973 , we will probably start by implementing metrics as standalones.

This issue aims to discuss on what metrics we need and how we can implement this in a package structure.

Suggestions welcome.

My initial thought was to have a metrics package with subpackages for each research area like vision, text, audio etc.

CC @srush @Borda @williamFalcon @Darktex

As a start: For vision I'd like to have the following:

[ ] accuracy
[ ] precision
[ ] recall
[ ] f1 score
[ ] roc
[ ] auc
[ ] dice coefficient
[ ] panoptic quality
[ ] IOU
[ ] SSIM

github-actions[bot] commented 4 years ago

Hi! thanks for your contribution!, great first issue!

Borda commented 4 years ago

I like the structure...

SkafteNicki commented 4 years ago

Dividing only into research areas would mean duplication of some metrics, for example accuracy is used more or less within all fields. I think it would be better to mainly divide into a regression and classification subpackage depending on targets begin continues or discrete. Specific metrics (like BLEU in NLP) could be in research-specific subpackeges.

justusschock commented 4 years ago

I agree, and these metrics (like accuracy) would not fall in any of these but remain in the base package.

I don't want to divide them into regression and classification and also have subpackages for all the research areas as it may become non-trivial where to find the desired metric. Another thing we could think of is not having subpackages at all, but just one metric package containing them all (just like torch.nn)

Borda commented 4 years ago

Dividing only into research areas would mean duplication of some metrics, for example accuracy is used more or less within all fields. I think it would be better to mainly divide into a regression and classification subpackage depending on targets begin continues or discrete. Specific metrics (like BLEU in NLP) could be in research-specific subpackeges.

I would rather avoid deep metric structures, one level is enough... So we can have general purpose like accuracy and the domain specific =) And then make all imported from root metric init...

seandatasci commented 4 years ago

cv: panoptic quality and IOU Augmentation: affinity and diversity

haotongye commented 4 years ago

Some metrics may even be dataset-specific, e.g., F1 score for SQuAD (there are some preprocessing and special rules involved). For these kind of less general metrics, I think there should be a base Metric class for people to inherited on and create their own.

For some reference, this is how I implement mine, and this is from PyTorch Ignite.

Also, should losses be considered as some type of metrics?

justusschock commented 4 years ago

@haotongye I would say, that we shouldn't include dataset specific metrics here. But I agree, we should have a base class metric (probably just a torch.nn.Module with some extras). This will, however, be hard for the functional interface.

For now, I wouldn't include losses, as this would really broaden the scope. Maybe we can do this afterwards in a separate effort.

justusschock commented 4 years ago

@seandatasci Can you link a paper or reference implementation for the affinity and diversity part? AFAIK there are several ways to calculate these...

williamFalcon commented 4 years ago

some requested:

confusion matrix
f1
AUC/ROC
rouge
bleu

SkafteNicki commented 4 years ago

metrics for continues output:

mean squared error (MSE) / root mean squared error (RMSE)
mean absolute error (MAE)
root mean squared Logarithmic Error (RMSLE)
maxerror
cosinesimilarity

would be nice to have:

R2 score (coefficient of determination)
Correlation (Pearson/Spearman)
Explained variance score

however, as far as I know the last 3 require access to the full list of targets and predictions at ones, so they can only be used for smaller datasets.

shubhamagarwal92 commented 4 years ago

As I mentioned in the tweet for NLG this repo can directly be integrated (?)
If planning to also include support for Vision&Language tasks such as VQA/Visdial etc. which are proposed mostly as discriminative tasks, R@{1,5,10} / MRR/ NDCG can also be used. One nice implementation by Pythia here.

Let me know if I can help! Thanks.

justusschock commented 4 years ago

@shubhamagarwal92 We probably will have to adjust the metrics for NLG according to our upcoming metrics interface, but other then that it should be fine. If you want to, you can take this, once we have our interface running (probably tomorrow).

seandatasci commented 4 years ago

@justusschock https://arxiv.org/abs/2002.08973

Borda commented 4 years ago

Let me know if I can help! Thanks.

Help is always welcome =)