Closed Lucas-Prates closed 2 years ago
Merging #222 (d88ecec) into master (2f2a080) will increase coverage by
0.00%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #222 +/- ##
=======================================
Coverage 98.76% 98.77%
=======================================
Files 40 40
Lines 972 978 +6
=======================================
+ Hits 960 966 +6
Misses 12 12
Flag | Coverage Δ | |
---|---|---|
unittests | 98.77% <100.00%> (+<0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
src/ruptures/metrics/hamming.py | 100.00% <100.00%> (ø) |
|
src/ruptures/metrics/randindex.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 2f2a080...d88ecec. Read the comment docs.
Many thanks for this very useful PR.
The implementation of Rand index is indeed quite naive. I will take a look.
Am I correct in assuming that the code you provided implements this report? I am still looking at the pseudo-code.
Am I correct in assuming that the code you provided implements this report? I am still looking at the pseudo-code.
Yes, I am the author of the report and the code is an implementation of that pseudo-code. Thank you for all your observations/corrections, I apologize for the mess. The implementation using islice
is much cleaner, I tried using pairwise but did not know how to skip iterations.
For the sake of completeness, I will provide an observation on the run time of the algorithms.
I did a simple simulation, and the figure below shows a comparison of their performance as sample size increases. _riclean denotes the algorithm after the corrections, _rihotmess the algorithm from the initial PR and _riham the original implementation.
For large sample sizes, e.g. N > 1e5, _riclean takes much more time than _rihotmess, which I do not understand why. I did some profiling, and both algorithms are almost identical: they perform the exact same number of iterations on the loops, and their memory usage (as shown by the profiler) is the same. My guess to explain this discrepancy is on the evaluation of [0] + bkps1
and [0] + bkps2
on the argument of the pairwise
function. Is python copying the list? If so, other methods using this construction might suffer from the same problem? But of course, this last part is speculation.
Thanks for this detailed information. Indeed, this is surprising to see that ri_clean
performs so much worse.
Can you tell us if using temporary variables (e.g. bkps1_with_0 = [0] + bkps1
) solves the problem? Also, is n_samples
the size of bkps1
/bkps2
?
I did try to use temporary variables, but the effect seems to persist, as you can see below
_ri_cleanw0 refers to using temporary variables. I also omitted the original implementation.
With respect to your question about n_samples
, if you are referring to the variable on the code, no, it is the number of samples of the data set, which should be the maximum (and last entry) of bkps1
and bkps2
.
However, I believe your doubt is on the sample size of the simulation. If the sample size was N (which I varied from 1e3 to 1e4 in steps of 1e3), the change-point sets sizes were N/2. I did this because the original implementation allocated a N*N matrix independently of the change-point set, so I wanted to see the effect. On the figure of this comment, I displayed the information directly on the change-point set.
@Lucas-Prates I have made some changes. Can you check if the performance is still so-so?
Yes, I will check and post it here when I get the chance.
Sorry for the delay.
It seems to have improved with the last changes, but there is still a gap.
I ran the simulations on my machine. I can provide the scripts if you want to test it yourself.
Thx @Lucas-Prates ! Yes, can you please provide the script ?
Sure! Just make a venv, install the requirements and run "run_all.sh" ( or the files separately). It should produce a pickle and csv from the simulation results, along with a .png figure for the runtime comparison. simulation_ruptures.zip
The "time profiling" here is very simplified, but I think that does not account for the discrepancy.
I looked from several angles and at the end, it seems to me that it is really the impact of using iterables for which we have to unpack values. In other words, maybe there is not much that would more efficient than accessing an item in a list with its index.
Adding the pairwise
function adds an overhead (which is not the reason for the spread between hotmess
and clean
, but still), but more importantly it seems like consequently, the iteration over the iterable returned by pairwise
slows the process as compared to the iteration over range
because in the former case the system is busy doing stuffs unrelated to the computation performed within the loop.
See the figures below :
The hotmess
implementation stays busy doing computations related to what is performed in the inner loop (min
, max
, abs
)
The clean
implementation is busy doing things unrelated to what is performed in the inner loop
The time scales on the x-axis on both graph are the same !
I tried to save some calls to pairwise
by calling it once for bkps1
and once for bkps2
, same behaviour ⬇️
hotmess
clean
I hope this is clear, I can reformulate if necessary.
My opinion, as the light of this, and if I did not miss anything, would be to stick to the hotmess
implementation that scales. And TBH, IMO it is not not clean, it is clean :)
Thanks @oboulant for the detailed comment.
Since the call to pairwise
in the second for loop is costly, the best approach is to create a list to iterate over, i.e. replace pairwise(bkps2_with_0)
by list_of_regimes2
as in https://github.com/deepcharles/ruptures/pull/222#discussion_r779195252 where list_of_regimes2=list(pairwise(bkps2_with_0))
.
This solves the complexity discrepancy between the two versions, and keeps the code clean.
Thanks @oboulant for the detailed comment. Since the call to
pairwise
in the second for loop is costly, the best approach is to create a list to iterate over, i.e. replacepairwise(bkps2_with_0)
bylist_of_regimes2
as in #222 (comment) wherelist_of_regimes2=list(pairwise(bkps2_with_0))
.This solves the complexity discrepancy between the two versions, and keeps the code clean.
I agree with the proposed replacement in the sense that it is cleaner because it saves a call to pairwise
each time we start over with the inner loop.
Nevertheless, it is basically what I did in the second set of screenshots that I shared : it does not solve the performance issue.
At the end, seems like iterating over range
vs. over what we have for list_of_regimes2
is way more efficient. That is why I said why not go with the implementation with range
?
That is a very interesting/precise analysis @oboulant, and it seems to explain the discrepancy we have been discussing. Just to check my understanding, the blank gaps in the figures associated with clean
are due to computations performed outside the inner loop?
Blank gaps corresponds to time spent directly in randindex()
and not in a child call.
Since the code running is the following :
for (start1, end1) in pairwise(bkps1_with_0):
for (start2, end2) in pairwise(bkps2_with_0[beginj:]):
nij = max(min(end1, end2) - max(start1, start2), 0)
disagreement += nij * abs(end1 - end2)
# we can skip the rest of the iteration, nij will be 0
if end1 < end2:
break
else:
beginj += 1
Blank gaps means we are performing task that are not :
max
min
abs
So basically it leaves that blank spaces are computations related to:
+=
operations*
operations<
operationsSince the +=
, *
, <
operations are also performed in the hotmess
, it leaves the thing on the iterables. But maybe I am missing something !
Thank you for the detailed explanation. I do not know what could be missing.
Now, there should not be any difference in performance between the first version and the current one.
I also cleaned the associated documentation and added a reference to your work @Lucas-Prates .
We should be good to go now.
Here you are, with the discussed modifications
Greetings.
I believe there is a minor mistake on the scaling done on the hamming method, it should be _n_samples*(nsamples-1).
I also provided an additional implementation of the Rand index under the method randindex_cpd. This implementation is based on an specific expression for the metric on change-point problems. If we have N samples and change-point sets of size r and s, this algorithm runs on O(r+s) in time and O(1) in memory complexity. The traditional implementation runs on O(rs+N) in time and O(rs) in memory, albeit your implementation might use less due to sparsity. This new implementation seems to perform some orders of magnitude better, depending on N, in a few profiling that I did.
I can provide the proof of correctness of the algorithm if requested.
Thanks.