deepcharles / ruptures

ruptures: change point detection in Python
BSD 2-Clause "Simplified" License
1.6k stars 163 forks source link

perf: Rand index computation #222

Closed Lucas-Prates closed 2 years ago

Lucas-Prates commented 2 years ago

Greetings.

I believe there is a minor mistake on the scaling done on the hamming method, it should be _n_samples*(nsamples-1).

I also provided an additional implementation of the Rand index under the method randindex_cpd. This implementation is based on an specific expression for the metric on change-point problems. If we have N samples and change-point sets of size r and s, this algorithm runs on O(r+s) in time and O(1) in memory complexity. The traditional implementation runs on O(rs+N) in time and O(rs) in memory, albeit your implementation might use less due to sparsity. This new implementation seems to perform some orders of magnitude better, depending on N, in a few profiling that I did.

I can provide the proof of correctness of the algorithm if requested.

Thanks.

codecov[bot] commented 2 years ago

Codecov Report

Merging #222 (d88ecec) into master (2f2a080) will increase coverage by 0.00%. The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #222   +/-   ##
=======================================
  Coverage   98.76%   98.77%           
=======================================
  Files          40       40           
  Lines         972      978    +6     
=======================================
+ Hits          960      966    +6     
  Misses         12       12           
Flag Coverage Δ
unittests 98.77% <100.00%> (+<0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/ruptures/metrics/hamming.py 100.00% <100.00%> (ø)
src/ruptures/metrics/randindex.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 2f2a080...d88ecec. Read the comment docs.

deepcharles commented 2 years ago

Many thanks for this very useful PR.

The implementation of Rand index is indeed quite naive. I will take a look.

deepcharles commented 2 years ago

Am I correct in assuming that the code you provided implements this report? I am still looking at the pseudo-code.

Lucas-Prates commented 2 years ago

Am I correct in assuming that the code you provided implements this report? I am still looking at the pseudo-code.

Yes, I am the author of the report and the code is an implementation of that pseudo-code. Thank you for all your observations/corrections, I apologize for the mess. The implementation using islice is much cleaner, I tried using pairwise but did not know how to skip iterations.

Lucas-Prates commented 2 years ago

For the sake of completeness, I will provide an observation on the run time of the algorithms.

I did a simple simulation, and the figure below shows a comparison of their performance as sample size increases. _riclean denotes the algorithm after the corrections, _rihotmess the algorithm from the initial PR and _riham the original implementation.

runtime_comp

For large sample sizes, e.g. N > 1e5, _riclean takes much more time than _rihotmess, which I do not understand why. I did some profiling, and both algorithms are almost identical: they perform the exact same number of iterations on the loops, and their memory usage (as shown by the profiler) is the same. My guess to explain this discrepancy is on the evaluation of [0] + bkps1 and [0] + bkps2 on the argument of the pairwise function. Is python copying the list? If so, other methods using this construction might suffer from the same problem? But of course, this last part is speculation.

deepcharles commented 2 years ago

Thanks for this detailed information. Indeed, this is surprising to see that ri_clean performs so much worse.

Can you tell us if using temporary variables (e.g. bkps1_with_0 = [0] + bkps1) solves the problem? Also, is n_samples the size of bkps1/bkps2?

Lucas-Prates commented 2 years ago

I did try to use temporary variables, but the effect seems to persist, as you can see below

runtime_comparison_w0

_ri_cleanw0 refers to using temporary variables. I also omitted the original implementation.

With respect to your question about n_samples, if you are referring to the variable on the code, no, it is the number of samples of the data set, which should be the maximum (and last entry) of bkps1 and bkps2.

However, I believe your doubt is on the sample size of the simulation. If the sample size was N (which I varied from 1e3 to 1e4 in steps of 1e3), the change-point sets sizes were N/2. I did this because the original implementation allocated a N*N matrix independently of the change-point set, so I wanted to see the effect. On the figure of this comment, I displayed the information directly on the change-point set.

deepcharles commented 2 years ago

@Lucas-Prates I have made some changes. Can you check if the performance is still so-so?

Lucas-Prates commented 2 years ago

Yes, I will check and post it here when I get the chance.

Lucas-Prates commented 2 years ago

Sorry for the delay.

It seems to have improved with the last changes, but there is still a gap.

runtime_comparison

I ran the simulations on my machine. I can provide the scripts if you want to test it yourself.

oboulant commented 2 years ago

Thx @Lucas-Prates ! Yes, can you please provide the script ?

Lucas-Prates commented 2 years ago

Sure! Just make a venv, install the requirements and run "run_all.sh" ( or the files separately). It should produce a pickle and csv from the simulation results, along with a .png figure for the runtime comparison. simulation_ruptures.zip

The "time profiling" here is very simplified, but I think that does not account for the discrepancy.

oboulant commented 2 years ago

I looked from several angles and at the end, it seems to me that it is really the impact of using iterables for which we have to unpack values. In other words, maybe there is not much that would more efficient than accessing an item in a list with its index.

Adding the pairwise function adds an overhead (which is not the reason for the spread between hotmess and clean, but still), but more importantly it seems like consequently, the iteration over the iterable returned by pairwise slows the process as compared to the iteration over range because in the former case the system is busy doing stuffs unrelated to the computation performed within the loop.

See the figures below :

  1. The hotmess implementation stays busy doing computations related to what is performed in the inner loop (min, max, abs)

    Screenshot 2022-01-05 at 23 13 11
  2. The clean implementation is busy doing things unrelated to what is performed in the inner loop

    Screenshot 2022-01-05 at 23 12 31

The time scales on the x-axis on both graph are the same !

I tried to save some calls to pairwise by calling it once for bkps1 and once for bkps2, same behaviour ⬇️

  1. hotmess

    Screenshot 2022-01-05 at 23 36 58
  2. clean

Screenshot 2022-01-05 at 23 36 14

I hope this is clear, I can reformulate if necessary.

My opinion, as the light of this, and if I did not miss anything, would be to stick to the hotmess implementation that scales. And TBH, IMO it is not not clean, it is clean :)

deepcharles commented 2 years ago

Thanks @oboulant for the detailed comment. Since the call to pairwise in the second for loop is costly, the best approach is to create a list to iterate over, i.e. replace pairwise(bkps2_with_0) by list_of_regimes2 as in https://github.com/deepcharles/ruptures/pull/222#discussion_r779195252 where list_of_regimes2=list(pairwise(bkps2_with_0)).

This solves the complexity discrepancy between the two versions, and keeps the code clean.

oboulant commented 2 years ago

Thanks @oboulant for the detailed comment. Since the call to pairwise in the second for loop is costly, the best approach is to create a list to iterate over, i.e. replace pairwise(bkps2_with_0) by list_of_regimes2 as in #222 (comment) where list_of_regimes2=list(pairwise(bkps2_with_0)).

This solves the complexity discrepancy between the two versions, and keeps the code clean.

I agree with the proposed replacement in the sense that it is cleaner because it saves a call to pairwise each time we start over with the inner loop. Nevertheless, it is basically what I did in the second set of screenshots that I shared : it does not solve the performance issue.

At the end, seems like iterating over range vs. over what we have for list_of_regimes2 is way more efficient. That is why I said why not go with the implementation with range ?

Lucas-Prates commented 2 years ago

That is a very interesting/precise analysis @oboulant, and it seems to explain the discrepancy we have been discussing. Just to check my understanding, the blank gaps in the figures associated with clean are due to computations performed outside the inner loop?

oboulant commented 2 years ago

Blank gaps corresponds to time spent directly in randindex() and not in a child call.

Since the code running is the following :

    for (start1, end1) in pairwise(bkps1_with_0):
        for (start2, end2) in pairwise(bkps2_with_0[beginj:]):
            nij = max(min(end1, end2) - max(start1, start2), 0)
            disagreement += nij * abs(end1 - end2)

            # we can skip the rest of the iteration, nij will be 0
            if end1 < end2:
                break
            else:
                beginj += 1

Blank gaps means we are performing task that are not :

So basically it leaves that blank spaces are computations related to:

Since the +=, *, < operations are also performed in the hotmess, it leaves the thing on the iterables. But maybe I am missing something !

Lucas-Prates commented 2 years ago

Thank you for the detailed explanation. I do not know what could be missing.

deepcharles commented 2 years ago

Now, there should not be any difference in performance between the first version and the current one.

I also cleaned the associated documentation and added a reference to your work @Lucas-Prates .

We should be good to go now.

oboulant commented 2 years ago

runtime_comparison

Here you are, with the discussed modifications