GUDHI / gudhi-devel

The GUDHI library is a generic open source C++ library, with a Python interface, for Topological Data Analysis (TDA) and Higher Dimensional Geometry Understanding.
https://gudhi.inria.fr/
MIT License
258 stars 66 forks source link

Associated matching to Wasserstein distance #203

Closed MathieuCarriere closed 4 years ago

MathieuCarriere commented 4 years ago

@tlacombe

I think it would be great if gudhi.wasserstein.wasserstein_distance() could optionally return the optimal partial matching associated to the distance.

There are probably a couple of ways to implement this, the most naive one being to take, for each row i of the cost matrix Gamma, the maximum of Gamma[i,:] (although this could potentially return something that is not a matching if we ask for it before convergence).

tlacombe commented 4 years ago

Hi @MathieuCarriere , I actually have already "implemented" that (i.e. imported POT) for barycenters (see #167 ), although it is not fully robust yet. Basically, POT returns a transport plan, that is a matrix of size (n x n), which is theoretically a permutation matrix (i.e. exactly one 1 in each row/column, 0 elsewhere) when comparing two diagrams, but in practice numerical imprecisions can lead to some 0.99999 and 1E-12 entries. Thus, I applied some heuristic threshold, and never had an error yet. I think it is not super satisfactory on very large diagrams (more numerical errors / non-convergence could occur), but anyway, computing an exact optimal matching between super large point clouds is known to be intractable. Also, it is a bit disappointing to store a (n x n) matrix (in a non-sparse way) while we only want to return something of size (n)...

I also tried other solvers, like linear_sum_assignment by scipy, some homemade implementation of the Hungarian algorithm, etc. But in practice, POT was the fastest way to go by a significant margin, at least in my trials. However, I do not exclude that a better solver exists, maybe not implemented in Python.

I will propose an implementation soon, in wasserstein_distance.py I guess.

mglisse commented 4 years ago

in practice numerical imprecisions can lead to some 0.99999 and 1E-12 entries

If I remember correctly, you are using weights like 1/(m+n) so they sum to 1. Is that a requirement of POT, or could you avoid renormalizing the weights and pass integer weights like 1? This might eliminate this kind of issue. Of course you would need to renormalize the distance before returning it.

If necessary, we could extract the C++ code for the Hungarian algorithm from POT and adapt it to output a sparse matrix or something, although I'd rather avoid that.

tlacombe commented 4 years ago

If I remember correctly, you are using weights like 1/(m+n) so they sum to 1. Is that a requirement of POT, or could you avoid renormalizing the weights and pass integer weights like 1? This might eliminate this kind of issue. Of course you would need to renormalize the distance before returning it.

Yes I used these renormalization. But as far as I remember, it was required by POT (which is weird because on quick tests right now it seemed to work with unnormalized version). I will investigate that.

mglisse commented 4 years ago

as far as I remember, it was required by POT

Maybe it is required for Sinkhorn (which you were planning to use) but not for the Hungarian algorithm?

tlacombe commented 4 years ago

Maybe it is required for Sinkhorn (which you were planning to use) but not for the Hungarian algorithm?

Well it seems to work with Sinkhorn also... I will do more investigations soon. I am pretty sure I solved a bug when normalizing my weight vectors, but I can't remember which one.

tlacombe commented 4 years ago

Quick update: @mglisse told me that a sparse matrix implementation will be available in a future release of POT ; so I suggest to wait for this to implement your suggestion.

However, if you want a non-sparse one, I can go for this quickly.

tlacombe commented 4 years ago

Handled by PR #236 , I think this issue can be closed.

mglisse commented 4 years ago

Closing will happen automatically if the first message in your pull request (you can edit it), or one of the commit messages, contains "Fix #203".

tlacombe commented 4 years ago

Ah, nice to know !

mglisse commented 4 years ago

Seems that you put it in the one place where it doesn't work (the title of the PR), bad luck. I added it to the first message for you.