Parallelize PH computation using joblib

ulupo commented 3 years ago

This is a "naive" (in the sense of most obvious) first go at parallelizing the PH computation across all available CPU cores. It addresses #1. Submitting as a draft because, while it is not throwing errors, I have not tested it for correctness at all. In my resource monitor, I do witness full CPU usage (all cores are being used), but independent checks and runtime benchmarks should be made.

(Note: the joblib library is already required by scikit-learn, so I am not adding it explicitly in the list of dependencies in environment.yml.)

ulupo commented 3 years ago

It took a while, but here attached are my results after running RUNME.py to completion on my laptop. Some of the embeddings (e.g. the mammoth) look a bit different, but still plausible I'd think. results_parallelized.zip

ulupo commented 3 years ago

Thanks for the review @aywagner! Embarrassingly, I must admit I have not yet run the serial version of the code. Perhaps the issue is with memory use? Did you notice similar slowdowns in the other datasets?

I will look into the case of mammoth, but RUNME.py took a very long time to run, so it would be good to extract a relevant subset for testing this particular dataset. Could you maybe help me with this?

Do you have any comments on the final embeddings and scores as collected in my zip file? Do they look reasonable to you?

ulupo commented 3 years ago

@aywagner I have now tested vs the serial version and do observe a slowdown, by a factor even higher than the one you report (about 3x slower). Memory does not seem to be the issue at my end.

This is obviously very curious. Something is causing large overheads, but it is not clear what that is. I'll investigate.

ulupo commented 3 years ago

Oh! I now notice two things:

we are not really ever parallelizing, as num_subsets is set to 1 in https://github.com/aywagner/DIPOLE/blob/2c6c8caad913e193803dd7579df6695b17f513b6/drmethods.py#L58-L83. Is this what we want?
things are better when the threading backend in joblib is selected.

So I think all we are observing is the overhead of allocating several processes that would be "ready to go", but only in fact ever running one.

ulupo commented 3 years ago

OK, I get it now. The overheads are caused by joblib making one copy of the full distance matrix (which is 2242 x 2242) per parallel process (even when only one process is used). This is because we use self.top_loss and a full copy of self.top_loss with all of its attributes includes a copy the full distance matrix.

A simple, if unelegant and maybe unnecessary, solution is to forfeit DistTopLoss and implement it instead as a function which takes only the subsampled distance matrices as arguments. You can see such an approach in my latest commit. I am not saying it is the best solution, but it completely resolves the observed slowdowns on my machine.

ulupo commented 3 years ago

Here is a simple benchmark on my laptop (2 physical cores) after setting num_subsets to 2 in the mammoth dataset. The 2500 iterations took 4m14s without parallelism (n_jobs=None) and 3m12s with parallelism (n_jobs=2). Given that other computations occur in each iteration (the metric loss, the backward pass, etc.), we can't expect perfect scaling. Still, I think these benchmarks demonstrate that the solution works. Furthermore, I think (to be tested) we can expect much better scaling when we increase the maximum homology dimension and/or subset sizes.

aywagner commented 3 years ago

@ulupo Nice catches! I think the new DimensionReductionNetMask with _create_submatrices is a reasonable way to avoid passing the entire distance matrix. You mentioned earlier that things are better when the threading backend in joblib is selected - presumably prefer="threads" but I noticed that the code currently has prefer=None?

ulupo commented 3 years ago

Thanks @aywagner. With the new solution, prefer="threads" no longer has an advantage over prefer=None -- in fact, it seems to be very slightly slower. Also, with the new solution I no longer observe slowdowns when we try to parallelize but num_subsets=1 -- presumably because the amount of data that has to be copied in each separate parallel process is now trivial.

ulupo commented 3 years ago

Btw: should we think of applying the same solution also to DimensionReductionNet, and of removing the DistTopLoss class that would no longer be used?

aywagner / DIPOLE

Parallelize PH computation using joblib #4