Open antoine-tran opened 1 week ago
I think we might need to reconsider the generic metrics API in fairseq2 before merging this PR. Right now it pretty much uses torcheval which apparently uses a different approach to syncing metrics. I have a fundamental question though. In your use case, how do you combine fairseq2 metrics and HG evaluate? Looks like HG has some similar APIs like combine
that acts like MetricBag
. Which approach does make more sense in your opinion (1) using the two APIs in parallel in your code or (2) abstracting away HG under fairseq2.metrics (i.e. this PR)? If the latter, we might consider defining a Metric
interface in fairseq2 (instead of using torcheval'a Metrics
interface) and have derived interfaces for torcheval and HG that fit better to the corresponding libraries.
I think we might need to reconsider the generic metrics API in fairseq2 before merging this PR. Right now it pretty much uses torcheval which apparently uses a different approach to syncing metrics.
I think the decision of binding fairseq2 to torcheval is a good one. It is a thin, low-level library to help developing further libraries on top. HG on the other hand targets ready-to-use use cases and it makes sense to hide the synchronization logic inside.
I have a fundamental question though. In your use case, how do you combine fairseq2 metrics and HG evaluate? Looks like HG has some similar APIs like
combine
that acts likeMetricBag
. Which approach does make more sense in your opinion (1) using the two APIs in parallel in your code or (2) abstracting away HG under fairseq2.metrics (i.e. this PR)? If the latter, we might consider defining aMetric
interface in fairseq2 (instead of using torcheval'aMetrics
interface) and have derived interfaces for torcheval and HG that fit better to the corresponding libraries.
Good questions. Until now I always followed (1) to have a separate metrics API in evaluation tasks. It served well my use cases (pure evaluation), and now I am thinking of using the evaluation within the training loop. Having the same MetricBag API helps me "register" the states and continue later.
@cbalioglu Maybe we can indeed wait a bit for the use case to be clearer and gather more requirements before merging this PR.
What does this PR do? Please describe:
This PR adds a wrapper to HuggingFace's
evaluate.Metric
to make it compatible to fairseq2.metrics APIs. This enables evaluating fairseq2 model on many downstream tasks available in HuggingFace, using standard evaluation metrics such ascomet
,bleu(t)
,bertscore
, etc.Small changes in fairseq2.metrics:
evaluate.Metric
has internal support for multi-node evaluation, where the incremental metric updating code run in separate nodes (and writing results to temporary PyArrow tables), after that the finalcompute()
is done on rank 0 only, pulling results from the tables. To adapt this tofairseq2.metrics.bag.sync_and_compute_metrics()
, we introduce theauto_sync
attribute in the MetricBag, that will turn on if it has HF metrics. In other words, we leave the synchronization to the underlying metrics and not in the bag itself.One caveat is that we can not mix the HF and non-HF metrics within one bag, but put them in separate bags.
Fixes #{issue number}
Does your PR introduce any breaking changes? If yes, please list them: List of all backwards-incompatible changes.
Check list: