ot.emd with large datasets

johnalison commented 2 years ago

🚀 Feature

I was wondering if it would be possible to do the OT matrix computation (ot.emd) using sparse matrices, (eg: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_array.html#scipy.sparse.coo_array) where only the distances between pairs of entries below some cutoff would be given, and only the coupling for pairs entries above some cutoff would be returned.

Motivation

I'd like to compute the OT matrix with between large (O(500k)) datasets. Currently, the biggest limitation to dataset size is the memory requirement for storing the len(a) x len(b) matrices. In practice, nearly all of the input distances are large, and nearly all of the corresponding couplings are small. So I thought that the sparse matrices could be a good solution.

Alternatives

We are open to any suggestions for alternatives.

We currently approximate the full OT matrix by computing many OT matrices using O(10k) subsets of the source and target datasets. The effect of the full OT between the datasets is then approximated by "stitching together" the smaller coupling matrices. This reduces the effective statistical power of our data, which we think we can avoid with the sparse matrix implementation.

Additional context

The large datasets come from LHC data: https://arxiv.org/abs/2208.02807 Thank you for your OT implementation ! We have found it very useful in our work.

EduardoGoulart1 commented 9 months ago

I think it shouldn't be too difficult. One would have to adapt the EMD_wrapper to accept sparse arrays. I could give it a try

rflamary commented 9 months ago

Sorry for te late answer. This would be very interesting but I don't think teh current C++ implementation will allow for this. In order to solve very large problems we will need to compute teh cost matrix in a lazy way (recompute the cost only when needed in the solver) which is a classical strategy that saves memory but will also mean slower solvers. If we do that we should be able to work with and return sparse matrices. @EduardoGoulart1 we are obviously very interested by such a conribution.

EduardoGoulart1 commented 9 months ago

I would go with a more simple solution first. We just need to adapt the code to handle sparse cost/flow matrices. The network_simplex_simple.h would require nearly no changes. The main problem seems to be the EMD_wrapper.cpp where a full bipartite graph is explicitly created

rflamary commented 9 months ago

For a sparse cost, you suppose that the cost is infinite if not provided? It was never clear to me what it meant for Lemon solver not to set the values...

EduardoGoulart1 commented 9 months ago

@rflamary I guess he wants the library to accept graphs that are not full-bipartite, i.e. not all (src, tgt) pairs are available. The network simplex code already accepts that. The only modification would be the wrapper. Unfortunately, I'm out for the next 10 days, so I can't prototype that (I would also only do it after doing some cleanups in the C++ code)

Edit: Essentially, that means that sparse costs are infinite. The only complication would if the input data is such that no feasible solution can be generated (but then I would argue that this is the user's problem)

rflamary commented 9 months ago

@EduardoGoulart1 that is very clear and definitely something of interest to us. Take your time and feel free to connect to the slack for direct discussions when your are back.

PythonOT / POT