Closed lorenzozanisi closed 1 year ago
Hi @lorenzozanisi, did you set the n_jobs
parameters to an integer larger than 1?
Thanks for the fast reply @xuyxu, yes as you can see I call with Parallel(n_jobs=self.num_models) as parallel
, where self.num_models=5
Kind of strange since the paralleism feature is well tested before. Could you provide your package version of joblib
and torch
, I will take a closer look.
joblib = 1.1.1 torch = 1.8.1+cu111 python = 3.7.5
Note, given how my environment is set up I need that version of python. With this, joblib
will throw an error as it calls pickle
internally with protocol=5
which is supported only in python>=3.8. I substituted import pickle
with import pickle5 as pickle
in all the relevant places in joblib
and it runs without errors. I don't think this is enough for the parallelisation to break though.
Here is my result when training the VotingClassifier
in examples/classification_cifar10_cnn
with joblib = 1.1.0, torch = 1.13.0, and python=3.9:
n_jobs=None
(i.e., no parallelization) | training time: 163.34s | evaluating time: 3.21 sn_jobs=5
| training time: 314.68 s | evaluating time: 1.02 sThe speedup should be acceptable considering the large cost on pickling model and copying data.
Could you further provide the following information:
nvidia-smi
when training a single model, if the gpu utilization is high in this case, no speedup is actually expected, since the bottleneck lies on computation. joblib cannot bring any benefit unless a more powerful GPU was used.Hi @xuyxu I rebuilt my env for python 3.9 and now it works
Hi
I want to train an ensemble of NNs on a single GPU in parallel. At the moment I am doing simply:
However this does not work in parallel as there are CPU overheads that prevent the spawning of multiple kernels.
TorchEnsemble should deal with these overheads by using
joblib
'sParallel
anddelayed
- that is, I should be able to start one kernel for each NN training and thus it is possible to parallelise the ensemble on the same GPU.However this is not what I'm seeing. The code below is a slight re-implementation of your BaggingRegressor, and I am seeing the same training times of my naive implementation above.
The dataset is just a standard pytorch
Dataset
object. Each model is quite small, and same for the each batch of data, so I can definitely fit all the ensemble on the GPU.Do you have any insight as to why I cannot parallelise my ensemble efficiently with the code below? Many thanks!