Closed ulupo closed 2 years ago
It took a while, but here attached are my results after running RUNME.py
to completion on my laptop. Some of the embeddings (e.g. the mammoth) look a bit different, but still plausible I'd think.
results_parallelized.zip
Thanks for the review @aywagner! Embarrassingly, I must admit I have not yet run the serial version of the code. Perhaps the issue is with memory use? Did you notice similar slowdowns in the other datasets?
I will look into the case of mammoth, but RUNME.py
took a very long time to run, so it would be good to extract a relevant subset for testing this particular dataset. Could you maybe help me with this?
Do you have any comments on the final embeddings and scores as collected in my zip file? Do they look reasonable to you?
@aywagner I have now tested vs the serial version and do observe a slowdown, by a factor even higher than the one you report (about 3x slower). Memory does not seem to be the issue at my end.
This is obviously very curious. Something is causing large overheads, but it is not clear what that is. I'll investigate.
Oh! I now notice two things:
num_subsets
is set to 1 in https://github.com/aywagner/DIPOLE/blob/2c6c8caad913e193803dd7579df6695b17f513b6/drmethods.py#L58-L83. Is this what we want?joblib
is selected.So I think all we are observing is the overhead of allocating several processes that would be "ready to go", but only in fact ever running one.
OK, I get it now. The overheads are caused by joblib
making one copy of the full distance matrix (which is 2242 x 2242) per parallel process (even when only one process is used). This is because we use self.top_loss
and a full copy of self.top_loss
with all of its attributes includes a copy the full distance matrix.
A simple, if unelegant and maybe unnecessary, solution is to forfeit DistTopLoss
and implement it instead as a function which takes only the subsampled distance matrices as arguments. You can see such an approach in my latest commit. I am not saying it is the best solution, but it completely resolves the observed slowdowns on my machine.
Here is a simple benchmark on my laptop (2 physical cores) after setting num_subsets
to 2 in the mammoth dataset. The 2500 iterations took 4m14s without parallelism (n_jobs=None
) and 3m12s with parallelism (n_jobs=2
). Given that other computations occur in each iteration (the metric loss, the backward pass, etc.), we can't expect perfect scaling. Still, I think these benchmarks demonstrate that the solution works. Furthermore, I think (to be tested) we can expect much better scaling when we increase the maximum homology dimension and/or subset sizes.
@ulupo Nice catches! I think the new DimensionReductionNetMask
with _create_submatrices
is a reasonable way to avoid passing the entire distance matrix. You mentioned earlier that things are better when the threading backend in joblib
is selected - presumably prefer="threads"
but I noticed that the code currently has prefer=None
?
Thanks @aywagner. With the new solution, prefer="threads"
no longer has an advantage over prefer=None
-- in fact, it seems to be very slightly slower. Also, with the new solution I no longer observe slowdowns when we try to parallelize but num_subsets=1
-- presumably because the amount of data that has to be copied in each separate parallel process is now trivial.
Btw: should we think of applying the same solution also to DimensionReductionNet
, and of removing the DistTopLoss
class that would no longer be used?
This is a "naive" (in the sense of most obvious) first go at parallelizing the PH computation across all available CPU cores. It addresses #1. Submitting as a draft because, while it is not throwing errors, I have not tested it for correctness at all. In my resource monitor, I do witness full CPU usage (all cores are being used), but independent checks and runtime benchmarks should be made.
(Note: the
joblib
library is already required byscikit-learn
, so I am not adding it explicitly in the list of dependencies inenvironment.yml
.)