GUDHI / gudhi-devel

The GUDHI library is a generic open source C++ library, with a Python interface, for Topological Data Analysis (TDA) and Higher Dimensional Geometry Understanding.
https://gudhi.inria.fr/
MIT License
259 stars 66 forks source link

`CubicalComplex` computed persistence differs according to external deps versions #790

Closed astamm closed 1 year ago

astamm commented 1 year ago

In preparing the release of rgudhi for enabling the use of GUDHI in R, my CI which runs the install on several platform spotted an issue in the unit tests for CubicalComplex and its periodic version: https://github.com/LMJL-Alea/rgudhi/actions/runs/3886901289.

Specifically, I request install of scikit-learn==1.2.0 and gudhi==3.7.1. So this explains why CI fails with Python 3.7 as latest versions of scikit-learn are not compatible with Python 3.7. But I cannot figure out the rest of the issues. Specifically, I generate fake data which is a list of two identical lists containing 10 numbers from 0 to 1 equally spread. Then I instantiate the CubicalComplex class or the periodic version and run the method .persistence() and this gives two very different results according to the external deps version.

Specifically, it always fails to match expected value on ubuntu 20.04 (while it finds the expected value on macOS and windows) with:

But still with ubuntu 20.04, it works well with:

astamm commented 1 year ago

Is there some randomness in the way the cubical complex computes persistence which could explain such differences?

mglisse commented 1 year ago

Hi Aymeric, is this an issue that only happens with R, or is it possible to reproduce it directly with Python? I don't think there is supposed to be any randomness in computing the persistence diagram of a periodic cubical complex, but it is hard to say without a full example, and I don't know what to look at...

astamm commented 1 year ago

Hey Marc. Well I only see through the R interface but I don't see any obvious reason why that would be R-specific as I am only calling Python classes and methods. Specifically the unit test setup in R is:

n <- 10
X <- cbind(seq(0, 1, len = n), seq(0, 1, len = n))
cc <- CubicalComplex$new(top_dimensional_cells = X)
cc$compute_persistence()$persistence()

the Python equivalent of which would be probably something like:

import numpy as np
import gudhi as gd
n = 10
X = np.linspace(0, 1, num = n)
X = [X, X]
cc = gd.CubicalComplex(top_dimensional_cells = X)
cc.compute_persistence()
cc.persistence()

Then we would need to test this Python code on the various platforms I mentioned.

mglisse commented 1 year ago

As expected, this prints

[(0, (0.0, inf))]

on linux, whatever the version of python used.

mglisse commented 1 year ago

cc.compute_persistence() cc.persistence()

By the way, this is redundant, persistence already calls compute_persistence, so this computes it twice.

astamm commented 1 year ago

As expected, this prints

[(0, (0.0, inf))]

on linux, whatever the version of python used.

Which Python versions did you include in your tests ? 3.8 to 3.10 ? which Linux distribution ? The problem seems to be specific to Ubuntu 20.04.

astamm commented 1 year ago

Ok there is actually no issues with the cubical complex, only with the periodic cc. When I do:

cc = gd.PeriodicCubicalComplex(top_dimensional_cells = X, periodic_dimensions = True)
cc.persistence()

I get on most systems a diagram with points for dimension 0 and 1 but on Ubuntu 20.04 for some Python and R versions, I get something in dimension 2 as well so that .persistence(), .betti_numbers(), .num_simplices(),.persistent_betti_numbers()and.cofaces_of_persistence_pairs()` does not return the expected value.

mglisse commented 1 year ago

cc = gd.PeriodicCubicalComplex(top_dimensional_cells = X, periodic_dimensions = True)

periodic_dimensions is supposed to be a vector<bool>. If you pass True, I don't know, maybe it converts True to 1, constructs a vector of size 1, and accessing the second element is then undefined behavior...

It does seem easy to misuse periodic_dimensions, we should either diagnose the error, or interpret a bool the same as an array with the right size and constant value.

mglisse commented 1 year ago

Actually, when I try passing periodic_dimensions = True, I get an error

TypeError: 'bool' object is not iterable

I guess it could depend on the version of [cp]ython used, but are you sure that's really the code you tried?

astamm commented 1 year ago

Yes, what I run in R is with

periodic_dimensions = TRUE

which is converted by reticulate::r_to_py() into True. So indeed since in this example it actually would need a length-2 boolean vector, it then has to infer the second element. In R, this is silently achieved by recycling, i.e., in this case it would automatically extend to c(TRUE , TRUE) but it might not understand always that it has to recycle because it needs to understand that from the Python side in this case. Plus, in any event, recycling silently is a bad thing. So I guess I'll start by passing a boolean vector and see if that improves things.

astamm commented 1 year ago

Ok that seems to be the reason.

With the second element to TRUE:

> pcc <- PeriodicCubicalComplex$new(
+     top_dimensional_cells = X,
+     periodic_dimensions = c(TRUE, TRUE)
+   )
> pcc$persistence()
# A tibble: 4 × 3
  dimension birth death
      <int> <dbl> <dbl>
1         2     1   Inf
2         1     0   Inf
3         1     1   Inf
4         0     0   Inf

which is what I get on ubuntu 20.04 when my test fails. And with the second element to FALSE:

> pcc <- PeriodicCubicalComplex$new(
+     top_dimensional_cells = X,
+     periodic_dimensions = c(TRUE, FALSE)
+   )
> pcc$persistence()
# A tibble: 2 × 3
  dimension birth death
      <int> <dbl> <dbl>
1         1     1   Inf
2         0     0   Inf

which is what I get on the rest of platforms.

astamm commented 1 year ago

cc.compute_persistence() cc.persistence()

By the way, this is redundant, persistence already calls compute_persistence, so this computes it twice.

I added in all R classes a private attribute that tracks whether the persistence has already been computed. That way, whenever the user calls .compute_persistence(), it actually only runs if that attribute is FALSE. Maybe that could be interesting to do so also in the Python classes or even at the C++ level?

mglisse commented 1 year ago

I added in all R classes a private attribute that tracks whether the persistence has already been computed. That way, whenever the user calls .compute_persistence(), it actually only runs if that attribute is FALSE.

Then you have to take care to set it back to false whenever someone for instance changes a filtration value or inserts a simplex in the SimplexTree. Also, you need to remember what options were passed to compute_persistence, in case the second call has different options.

astamm commented 1 year ago

Yes you are right. It needs some care in its implementation to get things right.

mglisse commented 1 year ago

I think this one was only on the R side, so we can close it.