Add metadata about model training

jonas-hurst commented 1 week ago

:rocket: Feature Request

Add metadata to the mlm-extension that describes training results

:sound: Motivation

As a researcher and model user, I want to know about the training results of the model, so that I can properly assess the model before using it in my application.

:satellite: Alternatives

Put this into writing in the model output description.

:paperclip: Additional context

Add loss and accuracy measures for both training and validation datasets, ideally per epoch. Minimal example of training for three epochs:

"mlm:training": {
        loss_function: "CrossEntropyLoss",
        loss_training: [0.4, 0.2, 0.175],
        loss_validation: [0.45, 0.22, 0.19],
        accuracy_training: [0.6, 0.8, 0.9],
        accuracy_validation: [0.6, 0.79, 0.85]
}

Please give some feedback about other training info or metrics that could be included here. I am happy to submit a PR if this is desirable.

rbavery commented 5 days ago

Hi Jonas thanks for raising this issue. I agree it's important to have a standard for sharing model evaluation and training loss curves or other epoch-wise metrics.

I think the scope of this could get pretty large, so maybe we should discuss this more with @fmigneault and get more community feedback before PRing a solution with an intent to merge.

There are a lot of different loss functions, sometimes multiple losses are used to train a single model (MaskR-CNN), and it's common to evaluate a model against multiple metrics. In the minimal example you provided as a starting point, it's not clear what evaluation metric is being used so maybe there would be an additional field for the metric and "mlm:training" could point to an array of training result objects rather than a single object.

There are some other projects to draw inspiration from and integrate with for reporting training results. We could follow MLFlow or Tensorboard's example for reporting losses if they use any default keys. Or if this is handled at the ml framework level we could follow Keras or Pytorch's example. These could provide an set of choices in the documentation but we should probably leave the field open ended in the schema for new metrics and loss functions.

https://www.tensorflow.org/tensorboard/scalars_and_keras https://mlflow.org/docs/latest/tracking.html

To this minimal example I would add an object for summary metrics on one or more test sets and fields describing the test set. These metrics could be related to the name of the model referenced in a model asset or source code asset.

fmigneault commented 5 days ago

I agree that this information is important. The employed loss function is just as relevant as the number of epochs, batch-size, or any other parameter that can modify whichever data, model configuration, and conditions/decisions/rules that affect the training pipeline. Given that, my first intuition would be to insert metadata about loss functions into the mlm:hyperparameters. Doing a hyperparameter search by loss function would be just as valid as doing, for example, cross-validation with different random states. Therefore, it seems appropriate to regroup all of this metadata together.

My main concern about creating a separate field like mlm:training, as pointed out by @rbavery, is that there are a lot of ways loss functions can be combined, and each ML framework will have their own subtleties and specific configurations. Therefore, mlm:training would most probably end up being an "any" JSON object to accommodate all possible combinations, which is what mlm:hyperparameters already offers. Also, given that some hyperparameters are sometimes used by the loss functions themselves (e.g.: epsilon value), the information risks being inconsistently placed in one field or the other. Code/APIs trying to resolve where certain hyperparameters are defined would have to deal with a convoluted mixture of mlm:hyperparameters and mlm:training.

That being said, where a separate field does make sense, is for reporting the metrics measured during training/validation/testing. Regarding that, I also have similar concerns as mentioned by @rbavery — the definition should clearly indicate which metric is presented. However, even then, the name alone is insufficient. The definition should provide even more details, some of which (non-exhaustive) could be:

Which split does the metric apply to?
Which [split]-dataset(s) lead to the metric result?
- ideally, a STAC Collection reference, but could be something else
- what about multi-collection references? (metric per collection or globally)
Which classes the metric applies to? (in case of per-class vs global metrics)
Which units are relevant? (if any)
Which "metrics hyperparameters" are applicable? e.g.:
- β value for F-β, F1 being the special case of β=1
- X for mAP top-X in case of multi-proposals
Which formula is applied (and how to encode it) if an unusual/custom metric is used?

I've also had similar discussions about metrics reporting and model quality with the OGC TrainingDML-AI SWG not later than today, so this is definitely something of interest in the near future, but the specific structure to represent them definitely needs more reflection. There is also the possibility that this could be a separate extension on its own, which would allow annotating metrics/quality for derived collections that employed the MLM at inference time.

crim-ca / mlm-extension