Closed MathieuCarriere closed 4 years ago
Hi @MathieuCarriere ,
I actually have already "implemented" that (i.e. imported POT) for barycenters (see #167 ), although it is not fully robust yet. Basically, POT returns a transport plan, that is a matrix of size (n x n), which is theoretically a permutation matrix (i.e. exactly one 1
in each row/column, 0
elsewhere) when comparing two diagrams, but in practice numerical imprecisions can lead to some 0.99999
and 1E-12
entries. Thus, I applied some heuristic threshold, and never had an error yet. I think it is not super satisfactory on very large diagrams (more numerical errors / non-convergence could occur), but anyway, computing an exact optimal matching between super large point clouds is known to be intractable. Also, it is a bit disappointing to store a (n x n) matrix (in a non-sparse way) while we only want to return something of size (n)...
I also tried other solvers, like linear_sum_assignment
by scipy
, some homemade implementation of the Hungarian algorithm, etc. But in practice, POT was the fastest way to go by a significant margin, at least in my trials. However, I do not exclude that a better solver exists, maybe not implemented in Python.
I will propose an implementation soon, in wasserstein_distance.py
I guess.
in practice numerical imprecisions can lead to some
0.99999
and1E-12
entries
If I remember correctly, you are using weights like 1/(m+n) so they sum to 1. Is that a requirement of POT, or could you avoid renormalizing the weights and pass integer weights like 1? This might eliminate this kind of issue. Of course you would need to renormalize the distance before returning it.
If necessary, we could extract the C++ code for the Hungarian algorithm from POT and adapt it to output a sparse matrix or something, although I'd rather avoid that.
If I remember correctly, you are using weights like 1/(m+n) so they sum to 1. Is that a requirement of POT, or could you avoid renormalizing the weights and pass integer weights like 1? This might eliminate this kind of issue. Of course you would need to renormalize the distance before returning it.
Yes I used these renormalization. But as far as I remember, it was required by POT (which is weird because on quick tests right now it seemed to work with unnormalized version). I will investigate that.
as far as I remember, it was required by POT
Maybe it is required for Sinkhorn (which you were planning to use) but not for the Hungarian algorithm?
Maybe it is required for Sinkhorn (which you were planning to use) but not for the Hungarian algorithm?
Well it seems to work with Sinkhorn also... I will do more investigations soon. I am pretty sure I solved a bug when normalizing my weight vectors, but I can't remember which one.
Quick update: @mglisse told me that a sparse matrix implementation will be available in a future release of POT ; so I suggest to wait for this to implement your suggestion.
However, if you want a non-sparse one, I can go for this quickly.
Handled by PR #236 , I think this issue can be closed.
Closing will happen automatically if the first message in your pull request (you can edit it), or one of the commit messages, contains "Fix #203".
Ah, nice to know !
Seems that you put it in the one place where it doesn't work (the title of the PR), bad luck. I added it to the first message for you.
@tlacombe
I think it would be great if
gudhi.wasserstein.wasserstein_distance()
could optionally return the optimal partial matching associated to the distance.There are probably a couple of ways to implement this, the most naive one being to take, for each row
i
of the cost matrixGamma
, the maximum ofGamma[i,:]
(although this could potentially return something that is not a matching if we ask for it before convergence).