GUDHI / gudhi-devel

The GUDHI library is a generic open source C++ library, with a Python interface, for Topological Data Analysis (TDA) and Higher Dimensional Geometry Understanding.
https://gudhi.inria.fr/
MIT License
245 stars 65 forks source link

Adding the function to compute the topological disorder of a persiste… #1053

Open TristanMG opened 1 month ago

TristanMG commented 1 month ago

…nce diagram. The function is located in "vector_methods.py"

mglisse commented 1 month ago

Hello, thank you for your contribution. I am trying to think what the best way would be to integrate topological disorder in gudhi, the following is just a brain dump. The gudhi.representations module does not look like a great fit: it contains sklearn-compatible classes, works on lists of diagrams, and expects a diagram to be a list of (birth,death) (usually numpy.array) without dimension. We are trying to move away from the format (dim,(birth,death)). See https://gudhi.inria.fr/python/latest/cubical_complex_sklearn_itf_ref.html or https://github.com/GUDHI/gudhi-devel/pull/925 for the new format. If we assume a format like [dgm_dim0, dgm_dim1, ...], then using sklearn.compose.ColumnTransformer with the current Entropy class almost computes the right thing: all that is missing is the normalization (1-entropy/log(n)), and the sum. We could add an option to Entropy to do this specific normalization, and leave it to the user to compute the sum if they want to. I think that's my current preference. We could consider having sklearn-like transformers that work on lists of "complete diagrams" (when we have the diagrams in several dimensions as in the new format), not just lists of single diagrams as currently, for cases where ColumnTransformer is not appropriate. We could also consider a function that takes one full diagram (old or new format?) and computes the TD directly, as in the current code in this PR, but then it would probably need to go to a separate submodule, not representations, since it would not fit (pun!) with the sklearn interface. Once the API is decided, a simple test exercising the function will be needed before the PR can be merged (an example in the doc may work if it is written in a way that it is actually run and checked, or something in src/python/test). We tend to put references in biblio/bibliography.bib and reference them with :cite:. That's not as nice for people who read the doc through help(function_name), but it yields a nicer web doc.

TristanMG commented 2 weeks ago

Dear Marc,

Thank you for your email. We have discussed your proposed solutions and we have a few questions.

We first thought of putting the function computing topological disorder in the same file as the Entropy class, for their similarities. We are fine adding an option to the Entropy instead of a stand-alone function. With this approach, would you prefer us to modify the function Entropy.transform, adding the option for the user to either compute the persistent entropy or the topological disorder, or to create a new function Entropy.topological_disorder? I assume that if we choose this approach, it will make the transition to the new format of persistence diagrams easier by just modifying the Entropy class? We will add a test file in src/python/test/.

Best wishes, Tristan

Le lun. 3 juin 2024 à 23:20, Marc Glisse @.***> a écrit :

Hello, thank you for your contribution. I am trying to think what the best way would be to integrate topological disorder in gudhi, the following is just a brain dump. The gudhi.representations module does not look like a great fit: it contains sklearn-compatible classes, works on lists of diagrams, and expects a diagram to be a list of (birth,death) (usually numpy.array) without dimension. We are trying to move away from the format (dim,(birth,death)). See https://gudhi.inria.fr/python/latest/cubical_complex_sklearn_itf_ref.html or #925 https://github.com/GUDHI/gudhi-devel/pull/925 for the new format. If we assume a format like [dgm_dim0, dgm_dim1, ...], then using sklearn.compose.ColumnTransformer with the current Entropy class almost computes the right thing: all that is missing is the normalization (1-entropy/log(n)), and the sum. We could add an option to Entropy to do this specific normalization, and leave it to the user to compute the sum if they want to. I think that's my current preference. We could consider having sklearn-like transformers that work on lists of "complete diagrams" (when we have the diagrams in several dimensions as in the new format), not just lists of single diagrams as currently, for cases where ColumnTransformer is not appropriate. We could also consider a function that takes one full diagram (old or new format?) and computes the TD directly, as in the current code in this PR, but then it would probably need to go to a separate submodule, not representations, since it would not fit (pun!) with the sklearn interface. Once the API is decided, a simple test exercising the function will be needed before the PR can be merged (an example in the doc may work if it is written in a way that it is actually run and checked, or something in src/python/test). We tend to put references in biblio/bibliography.bib and reference them with :cite:. That's not as nice for people who read the doc through help(function_name), but it yields a nicer web doc.

— Reply to this email directly, view it on GitHub https://github.com/GUDHI/gudhi-devel/pull/1053#issuecomment-2146223986, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCNLCHG2QVJRY7BOBNS4R3LZFTT2RAVCNFSM6AAAAABIDRYZM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBWGIZDGOJYGY . You are receiving this because you authored the thread.Message ID: @.***>

mglisse commented 1 week ago

would you prefer us to modify the function Entropy.transform, adding the option for the user to either compute the persistent entropy or the topological disorder, or to create a new function Entropy.topological_disorder?

What I had in mind, in order to stick to the scikit-learn API, was to pass an option to __init__ saying that we want the disorder instead of the entropy (maybe allowing mode="disorder" would be better than a separate Boolean argument, but I didn't think about it very long). transform, after it has computed the scalar entropy, would check if this option was set, and in that case replace the entropy ent with 1-ent/log n before storing it in the output list. This would require a few tweaks (mode=="scalar"mode!="vector", various things in the doc), but nothing big.

The nx2 arrays that Entropy takes are already the new format — for a single dimension. With the new format for multiple dimensions, ColumnTransformer+Entropy should give the "split" version from your code, and it would still be up to the user to sum that to a single scalar. If you think it is useful, we could consider a convenience function that calls ColumnTransformer+Entropy+sum (or something roughly equivalent), but for that one I'd rather wait until the new format becomes the main one in Gudhi.

Does that sound ok? Don't hesitate to speak if you have a different opinion about what would be useful.