Wasserstein on Ubuntu 18 within Docker - Linux on Power (RHEL 7)

IBMRK commented 4 years ago

I'm running Eirene on Linux on Power PPC, based on Julia 1.41 which I compiled from source (what a mission!).

This is my high-level configuration (happy to share more detail as you may need):

Hardware: Power 9 CPU x 20, 150GB RAM OS: RHEL Openshift OS (Linux Kernel 3.10) Software: Docker 18.03 Docker Container: Ubuntu 18.04 (64bit), Julia 1.41 (compiled from source), Eirene

Beyond the above, I need to share a caveat early because I'm not sure if my issue is related to this aspect:

When I compiled Julia v1.41, HDF5 (precompiled Julia library v0.13.x) was 'broken', so I had to downgrade it (based on an obscure note that recommended that v0.11.x works). Not sure if this is relevant to my issue ... I'm actually planning a separate test (yet to do) of my workflow (below) on INTEL (with the standard HDF5, and smaller data set because I have much less RAM available on my laptop).

Now to the issue.

I've written a rudimentary Eirene program (just for testing) in the following way:

Initialize an Array (100x9) representing the CMB data => Sky_COM = [....]
Initialize an Array (100x9) representing the Gaussian SIM data => NL_50G = [....]
Calculate the CMB Homology, MaxDim 4, subject to available RAM => PH_Sky_COM = Eirene.eirene(Sky_COM,model="pc",maxdim=4)
Calculate the Gaussian SIM Homology, MaxDim 4, subject to available RAM => PH_NL_50G = Eirene.eirene(NL_50G,model="pc",maxdim=4)
Calculate Betticurve for CMB Homology => BC_Sky_COM = Eirene.betticurve(PH_Sky_COM,dim=0)
Calculate Betticurve for Gaussian SIM Homology => BC_NL_50G = Eirene.betticurve(PH_NL_50G,dim=0)
Calculate the Wasserstein distance between Betticurve of Homologies => Eirene.wasserstein_distance(BC_Sky_COM, BC_NL_50G, q=2, p=2)

BTW, I have tested this workflow for a smaller data sets (25x9, 50x9 instead of 100x9 above) and the programs ran quickly (5-12 minutes) and terminated without errors, providing the Wasserstein distance, which is the desired output result.

The problem is that Step 7 in the dataset (100x9) above, was running for over 2 days and I eventually killed it (suspecting a problem). What I did notice was that only 1 of our CPU's was running "julia" at 100% (overall the machine was 97% idle).

I hope I've at least managed to share at least something high level about the problem.

Appreciate if you could please let me know if you have any suggestions - happy to provide any additional info as you think best.

Eetion commented 4 years ago

Thanks for the report! @yossibokor may have some insight.

yossibokorbleile commented 4 years ago

Yes, thanks for the report!

@IBMRK, computing the Wasserstein distance between two persistence diagrams is computationally expensive, in particular as the number of points increases. Having a look at the size of your data sets, I am not surprised by the increased run time. To see if there are not any other issues, could you see how long it takes to calculate the (2,2)-Wasserstein distance between either BC_Sky_COM or BC_NL_50G to the persistence diagram containing a single point, (1,1).

IBMRK commented 4 years ago

Thanks, Yossi, I started a run as you've suggested.

Will keep you posted on the results.

A few questions please (out of interest):

1.) Is there any way to peek into the running process to access what it's doing? I would think that Linux "strace" could be used on the running "Julia" process, but then it would be helpful to know something of the program flow ...

2.) Is there any way to 'estimate' (very back-of-envelope) how much RAM each of the Eirene functions Homology/Betticurve/Wasserstein will take based the input (N x M) size? This could help a lot in my upcoming experiments ... naturally, (where possible) I'd like to push the barrier to a large as possible data set, (hopefully) without having to wait a few hours/days only to find out that it crashed "Julia".

3.) Is there any way to parallelize the Wasserstein function (split at the beginning, process in parallel, then join results from each part to finalise at the end)? I'd be keen to hear your thoughts on this in particular ... I'm sure this a bit of a Holy Grail and you might be getting asked this a lot by us newbies.

yossibokorbleile commented 4 years ago

Great! @Eetion might have more things to say in response to your questions, but here is at least the start of answers. @Chr1sWilliams might have some things to say as well. 1.) I don't know how you would do this, but I can provide an overview of what the Wasserstein distance calculation does. Given persistence diagrams D_1 and D_2, the Wasserstein distance can be thought of the cost of an optimal matching between the diagrams. This means we need to create new diagrams D'_1 and D'_2 which have the same cardinality, by adding in points on the diagonal. We end up with size(D'_1)=size(D'_2)=size(D_1)+size(D_2). After this, a Hungarian algorithm is used to obtain the optimal matching. Finally, we use this matching to calculate the appropriate cost. Hope this makes things as clear as mud.

2.) I can't speak for anything other than the wasserstein_distance, but to calculate the Wasserstein distance, we create a (size(D_1)+size(D_2))x(size(D_1)+size(D_2)) matrix, so depending on your point of view, a lot?

3.) For the (inf, q)-Wassterstein distance, @Chr1sWilliams and I are aware of a method for optimising the computation, but we have implemented more general Wasserstein distances. As yet, there is no way to parallelise the calculation of the Wasserstein distance as implemented here.

If anything else comes up, please let us know.

IBMRK commented 4 years ago

Hi

Some feedback - the program is still running, and progressing fine, just slowly.

I expect (from the current runtime) that it will need a few more days.

Will revert will results as soon as they are available, thanks.

Eetion commented 4 years ago

Hello all, and apologies for the long delay.

As regards your list, IBMRK,

1) This will complement Yossi's answer, as it will address only the persistence computation, not the Wasserstein metric. My best suggestion would be the Julia profiler (https://github.com/timholy/ProfileView.jl), which has come a long way from where it started several years ago. I have used this and it can be quite practical in certain circumstances. A more principled walk through the code is certainly possible, eg with a debugger, but the internal workings are convoluted. Creating a simpler implementation is on the to-do list for future development!

2) This question is a good one, and essentially intractable. The difference in memory cost and run time can be highly pronounced even for two data sets that seem quite similar (for example, if you compute the PH of a point cloud sampled uniformly from the unit cube in R^n, you'll get very different performance than you would if you first multiplied the first coordinate by 10 in each cloud, essentially stretching along the x axis). The process I always use is to start with small examples, progress to bigger ones, and attempt a rough interpolation. Two general rules of thumb: (a) if you don't need to let your distance parameter sweep all the way out through infinity, but can set an upper bound strictly smaller than the diameter (=greatest distance between two points) of the point cloud, this can reduce computation costs dramatically, (b) if you need to let your distance parameter sweep out to infinity, then the computation will probably start to slow down considerably when you have > 1,000 points. The best thing, though, is always to try some examples, since everything changes depending on the particular data set.

3) Seems we'll all have to wait for the next big thing from Yossi and @Chr1sWilliams.

Thanks!

Eetion / Eirene.jl

Wasserstein on Ubuntu 18 within Docker - Linux on Power (RHEL 7) #29