GUDHI / gudhi-devel

The GUDHI library is a generic open source C++ library, with a Python interface, for Topological Data Analysis (TDA) and Higher Dimensional Geometry Understanding.
https://gudhi.inria.fr/
MIT License
251 stars 65 forks source link

Randomness in Cover Complex #795

Open astamm opened 1 year ago

astamm commented 1 year ago

I understand that behind the scene all computations for the GIC in the cover complex are performed in C. However, there is no way from the Python interface to set the C seed for reproducibility unless I missed something. Could it be added?

mglisse commented 1 year ago

That's a general issue in Gudhi. Even in the C++ interface we seldom let users set the seed. Adding this possibility without breaking the API or making it too ugly may not always be trivial (or maybe it is...). (search for std::random_device to find a few relevant places)

It isn't high on our list of priorities for now, but if someone wants to tackle it, I'll make sure we review the PR.

astamm commented 1 year ago

I see, thanks. From a data science perspective, it is very important to be able to set the seed (and the generator algorithm) to ensure reproducibility of the results. But, maybe that only means adding an extra argument seed to any function or method that uses random numbers? Then, we also need to make sure it is reproducible inter-platforms. Is randomness always used via std::random? Or do you use also the boost module for randomness? Do you know whether they use the same underlying algorithm for generating random numbers?

mglisse commented 1 year ago

Hoping for too much reproducibility is illusory. For instance: https://pytorch.org/docs/stable/notes/randomness.html

Completely reproducible results are not guaranteed across PyTorch releases, individual commits, or different platforms. Furthermore, results may not be reproducible between CPU and GPU executions, even when using identical seeds.

(actually, computing the "same" distance matrix between 2 point clouds repeatedly 20 times using pykeops+pytorch, I get 20 slightly different matrices)

In addition to C++ , we also use random numbers from CGAL, from boost (through CGAL), from numpy (can be seeded globally). There is some randomness that comes from TBB, if you activate that.

It can be more convenient to set a seed globally instead of passing one to each function, although that may not work as well with parallelism.

While C++ defines exactly the output of generators like mt19937_64, the way it is used in uniform_int_distribution is specific to each implementation.

Some algorithms may use the order of some set of pointers, because it needs an order for efficient processing, but that order can change from one run to the next.

Etc, not trivial. Although again, if someone wants to tackle the lowest hanging fruits...

astamm commented 1 year ago

I understand your arguments. However, when I run a simulation study for a paper say, I need it to be reproducible, meaning that if reviewers or readers want to redo the experiment, they will end up with the same results, which makes total sense. I understand the comment from PyTorch but in my view this is a flaw. Reproducibility is of paramount importance in data science. It may not be easy to implement but nobody will use in a statistical analysis a method that relies on random numbers if it is not guaranteed to be reproducible. In R for instance, you wrap your code in withr::with_seed(seed, { # your code }) and you're done. I understand that for a software like Gudhi which is multi-lingual with external deps handling in different ways randomness, it is harder, but I really believe it should be high on the priority list. I am not yet familiar enough with the C++ API to elaborate a solution at the moment though. What I know is that I will remove the cover complex from the R interface for now probably because its results are not reproducible (this is the only place where I noticed non-reproducibility BTW).

astamm commented 1 year ago

For the cover complex, it seems that adding a generator such as std::mt19937 as argument to this function https://github.com/GUDHI/gudhi-devel/blob/83d1fc15e402551d2d4a7be8d4ecc124e9f886d5/src/Nerve_GIC/include/gudhi/GIC.h#L148 could do the trick. Then call GetUniform(rng) with a preinstantiated generator whose seed is either current time or a user-specified one.

mglisse commented 1 year ago

However, when I run a simulation study for a paper say, I need it to be reproducible, meaning that if reviewers or readers want to redo the experiment, they will end up with the same results, which makes total sense.

If rerunning the experiments a few times with different seeds gives wildly different results, IMO it means that the result is not reliable, even if there is a 1-in-a-billion seed that does give a good result.

Reproducibility is of paramount importance in data science.

It is important, I won't deny it, but IMO the bit for bit exact reproducibility is sometimes overestimated.

It may not be easy to implement but nobody will use in a statistical analysis a method that relies on random numbers if it is not guaranteed to be reproducible.

Depends who you ask...

In R for instance, you wrap your code in withr::with_seed(seed, { # your code }) and you're done.

Yes, a global parameter may be more convenient than passing a seed to every function.

mglisse commented 1 year ago

For the cover complex, it seems that adding a generator such as std::mt19937 as argument to this function

https://github.com/GUDHI/gudhi-devel/blob/83d1fc15e402551d2d4a7be8d4ecc124e9f886d5/src/Nerve_GIC/include/gudhi/GIC.h#L148

could do the trick.

You missed the next line. As I was saying, C++ does not specify std::uniform_real_distribution in a way that guarantees reproducibility, different standard libraries or platforms may derive different double from the same random bits provided by the random engine.

astamm commented 1 year ago

For the cover complex, it seems that adding a generator such as std::mt19937 as argument to this function https://github.com/GUDHI/gudhi-devel/blob/83d1fc15e402551d2d4a7be8d4ecc124e9f886d5/src/Nerve_GIC/include/gudhi/GIC.h#L148

could do the trick.

You missed the next line. As I was saying, C++ does not specify std::uniform_real_distribution in a way that guarantees reproducibility, different standard libraries or platforms may derive different double from the same random bits provided by the random engine.

This is where I think the R ecosystem is better organised than the Python one in terms of package repository. Upon acceptance, packages are built on all platforms with specific toolchains that makes reproducibility guaranteed only by setting the seed in the way I showed earlier. When people later install the package, it downloads a precompiled binary. Hence an extra-argument seed to methods using pseudo-random number generation would do the trick for rgudhi. Maybe not ideal though. I'll skip the cover complex for now. In any event, @MathieuCarriere told me there is an ongoing PR to make it sklearn-friendly. So I'll port it with the next release if we manage to find a way to reproducibility.

astamm commented 1 year ago

If we take another step back, there is no harm in adding the argument seed or random_number_generator in GetUniform(). We can make it behave by default as it behaves now, i.e. giving random results, but with the extra option of making things reproducible in R.

astamm commented 1 year ago

Non-reproducibility as I said is a big issue for e.g. hypothesis testing where by definition you make a decision about rejecting or not an hypothesis but you could be wrong. If solely because of the seed, you reject on some platforms or in some run but you don't in others, it feels just weird.

mglisse commented 1 year ago

I just had a look, GetUniform uses

    thread_local std::default_random_engine re;

and this default construction actually uses a hardcoded seed, always the same (even the same in different threads, which sounds like a bad idea to me...), so passing a seed to this function will not help make it more reproducible. Ah, no, you probably want to be able to reproduce it without restarting R :cry: so it would at least require something to reinitialize the random engine. Probably the parallel case is hopeless.

One thing we could easily do is replace default_random_engine with mt19937_64 for instance so we are sure all platforms use the same random engine, although that won't help with the fact that uniform_real_distribution depends on the platform.

astamm commented 1 year ago

Thanks @mglisse! What about relying on the Boost math module instead of std::random? Could that help for ensuring cross-platform compatibility?