IBM / unitxt

🦄 Unitxt: a python library for getting data fired up and set for training and evaluation
https://www.unitxt.ai
Apache License 2.0
151 stars 37 forks source link

Exponential time complexity when confidence interval calculation is enabled #1008

Open jezekra1 opened 1 month ago

jezekra1 commented 1 month ago

The code below will call a full-dataset metric computation for each row - i.e. the work is duplicated num_row times.

This is a big issue for huggingface metrics, such as metrics.meteor which will run for 3 hours instead of 10 seconds.

https://github.com/IBM/unitxt/blob/cd3e483e33d1bb36b602a9ebdcc9ccf3030395d8/src/unitxt/metrics.py#L420

You can replicating it by running metrics.meteor on a large dataset with confidence interval computation enabled.

elronbandel commented 1 month ago

@dafnapension can you have a look if you know how to fix?

dafnapension commented 1 month ago

Hi @elronbandel , and thanks for assigning this to me. An interesting issues. I have a few questions.. (0) As I understand the metric, it should be Instance, not Global, isnt it? Why did the run reach the point of compute_global_confidence_intervals ? Because class HuggingfaceMetric(GlobalMetric): . Why? At least he turned process_single_instances = True of GlobalMetric to False? (1) When we tried to be "iber_khukhem", and compute global metrics by hand, aiming at one pass over the instances, including their (simulated) resamples, @yoavkatz claimed that we are talking a few hundred instances, and this is why we can easily accommodate them all, including their resamples, conveniently in main memory, and whats the big deal. That the number of instances for which metric is computed is dramatically smaller than the number of instances started their way on the unitxt pipeline. Is this assumption still valid? How many instances participate in meteor 's metric computation? Can you tell? (2) by definition, confidence interval is computed by generating n_resamples , each the length of the number of instances over which the metric is computed (and hence assumed to be a few hundred, by above), as clearly explained in lines 404-405: so what is the surprise? Did he play with n_resamples to clarify the issue?
https://github.com/IBM/unitxt/blob/cd3e483e33d1bb36b602a9ebdcc9ccf3030395d8/src/unitxt/metrics.py#L404

By default, for global metrics, n_resamples =

https://github.com/IBM/unitxt/blob/cd3e483e33d1bb36b602a9ebdcc9ccf3030395d8/src/unitxt/metrics.py#L466

which is:
settings.num_resamples_for_global_metrics = (int, 100)

So the overall time could not be multiplied by more than 100. Did he change n_resamples ?

(3) I read somewhere that numpy.apply_along_axis is not fully optimized, for sake of being general. I can try replace it by hand_made piece of code, but I do not put too much hopes in this to shrink hours to seconds.. (4) Do you think there might by a problem with scipy.bootstrap ?

yoavkatz commented 1 month ago

I'm not sure there is any bug in the code. I believe the problem is that when a a single metric calculation of global metrics takes a few seconds, it is multiplied by 100 when using confidence internal.

I'm not sure why Meteor slow,I think the Rouge reason it is slow , is that in itself it does bootstrapping in the code:

https://huggingface.co/spaces/evaluate-metric/rouge/blob/main/rouge.py#L134

I think the solution is to avoid use aggregation, which I'm working on.

In any case, I recommend running with profiling 'python -m cProfile -s cumtime xx.py' and seeing the number of calls and what takes time.

dafnapension commented 1 month ago

meteor, inside hugging face:

    def _compute(self, predictions, references, alpha=0.9, beta=3, gamma=0.5):
        multiple_refs = isinstance(references[0], list)
        if NLTK_VERSION >= version.Version("3.6.5"):
            # the version of METEOR in NLTK version 3.6.5 and earlier expect tokenized inputs
            if multiple_refs:
                scores = [
                    meteor_score.meteor_score(
                        [word_tokenize(ref) for ref in refs],
                        word_tokenize(pred),
                        alpha=alpha,
                        beta=beta,
                        gamma=gamma,
                    )
                    for refs, pred in zip(references, predictions)
                ]
            else:
                scores = [
                    meteor_score.single_meteor_score(
                        word_tokenize(ref), word_tokenize(pred), alpha=alpha, beta=beta, gamma=gamma
                    )
                    for ref, pred in zip(references, predictions)
                ]
        else:
            if multiple_refs:
                scores = [
                    meteor_score.meteor_score(
                        [[word_tokenize(ref) for ref in group] for group in references][0],
                        word_tokenize(pred),
                        alpha=alpha,
                        beta=beta,
                        gamma=gamma,
                    )
                    for ref, pred in zip(references, predictions)
                ]
            else:
                scores = [
                    meteor_score.single_meteor_score(ref, pred, alpha=alpha, beta=beta, gamma=gamma)
                    for ref, pred in zip(references, predictions)
                ]

        return {"meteor": np.mean(scores)}
yoavkatz commented 1 month ago

Looking at running 100 example of Rouge, we see the metric being processed 101 times as expected (1 for each example, and then all the whole 100). This lead to 1 call to compute_global_confidence_intervals, which is on the full set of examples and 3 calls to apply_along_axis. It then added 200 calls to compute.

Need to investigate why multiple calls to apply_along_axis, we are calculating ci on one score. However, we still see the main problem is that a single invocation takes 0.269, and doing it 100 times, means it takes 50 sec..

  101    0.001    0.000   55.089    0.545 metrics.py:473(process)
  302    0.004    0.000   55.021    0.182 metrics.py:540(_compute)
  302    0.003    0.000   55.016    0.182 metrics.py:1722(compute)
    1    0.001    0.001   54.017   54.017 metrics.py:397(compute_global_confidence_intervals)
    1    0.000    0.000   54.015   54.015 _resampling.py:251(bootstrap)
    3    0.000    0.000   54.009   18.003 metrics.py:403(statistic)
    3    0.002    0.001   54.009   18.003 shape_base.py:267(apply_along_axis)
  201    0.001    0.000   54.006    0.269 metrics.py:421(<lambda>)
  201    0.000    0.000   54.001    0.269 metrics.py:406(metric)
  302    0.006    0.000   50.633    0.168 metrics.py:1195(compute)
dafnapension commented 1 month ago

I think, therefore, that for such a metric, the wrapping by unitxt's HuggingFace which itself is wrapped by GlobalMetric, and when GlobalMetric's process_single_instances = True (the default), the end result is computing each and every instance twice, and this is for each resample.

dafnapension commented 1 month ago

Rouge is another good example: why invoke Rouge for each and every instance, while hf, when invoked for the whole stream, does the computation for each and every instance. It can even return these each_and_every_instance if, as you said, we always set use_aggregator=False

Meteor, does not return the each_and_every, but rather -- the global simple average thereof (as I pasted above). Here, I think, unitxt should transfer Meteor to be an InstanceMetric rather than Global, as it is today.

At any rate, I think transferring such HF metrics to Instance Metrics can save much time, doing the CI on given scores, and not employing the heavy language models on each instance again and again.

arielge commented 1 month ago

@dafnapension @yoavkatz @elronbandel I have just encountered this myself and reviewed the code with @assaftibm. Note that a major part of the issue is with the confidence interval calculation method in scipy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html). The default that is currently used is a method called BCa. For global metrics, when this method is used the metric is being run not just n_resample times but this is multiplied by the total number of samples. This means that for a large dataset the metric can be called a huge number of times. This can be resolved by switching to a more naive CI calculation method like percentile, which may be less accurate but reduces the runtime immensely.

yoavkatz commented 1 month ago

In the #1011 , we fix Rouge and Metor by moving away from global metrics when it's not need. For f1 for example, it may be important to use global CI calculation.

yoavkatz commented 1 month ago

In anycase, using a different CI method, should be something we consider to user automatically (users can not be involved with this decision).

dafnapension commented 1 month ago

Thank you, @arielge , @assaftibm , and @yoavkatz , As @yoavkatz mentioned, we tried to use score_based_confidence_interval rather than compute_global_confidence_intervals wherever possible, depending on the "gloabality" of the metric involved. For the Meteor metric, around which the current "crisis" broke out, the scoring of each instance is computationally intensive. And it turns out that no more than instance score is needed for the computation of the Meteor global score, and hence the move to score_based_confidence_interval, followed by "home made" averaging (rather than letting Hugging face go the whole way), can save from (apparently unnecessary) repeating of computations of instance scores.

Only from you I learned about these extra computations along the sample. For a score_based_confidence_interval with input of 15 instances, I noticed 1000 invocations of (a simple aggregation over instance score, in this case) over 1000 resamples (the default for InstanceMetric), followed by 2 separate invocations over 15 instances each. I never knew.. I should have suspected something based on the statistic input arg for bootstrap , but I never really suspected until now.. Are these extra invocations always rather 'modest', compared to the big thing of n_resamples?

arielge commented 1 month ago

@dafnapension the extra computations are a direct function of the number of instances - so if you run on many instances they are not modest at all. This is the behavior of scipy bootstrap so should be the same for the global and score-based functions. So I think we must either always pass method="percentile" to the bootstrap() call, or pass it when the number of instances passes some threshold. The latter maybe makes sense - not just in terms of runtime, as when there are few instances it may be more significant to use the more sophisticated "BCa" method of calculating the CIs.

dafnapension commented 1 month ago

Thank you, @arielge and @assaftibm , for this important lesson! I added method change in case of more than 100 items to sample from. To this PR.