Error facing while trying to do model parallelism

Raviteja1996 commented 5 years ago

Hello I am trying to run the model using model parallelism. I have made some changes in the prototext file. I am attaching the changed file with this post. I am not getting the error which I faced before, which I posted in #860 but a different one. I am not getting any error while running with single GPU but getting the error while running with multiple GPU's. There is also no problem when running using data parallelism(multiple GPU), the error seems to come only when trying with model parallelism.

prototext file : https://drive.google.com/open?id=1wjG1wI-st5cACOueA4oPCiaPZ_hLU91W

command: I am in model_zoo directory and I am using the below command

mpirun -np 4 lbann --model=models/resnet50/model_resnet50.prototext.orig --optimizer=optimizers/opt_adam.prototext --reader=data_readers/data_reader_imagenet.prototext --num_epochs=5

Error: [smpowerai:30076] An error occurred in MPI_Ialltoall [smpowerai:30076] reported by process [3765370881,3] [smpowerai:30076] on communicator MPI_COMM_WORLD [smpowerai:30076] MPI_ERR_INTERN: internal error [smpowerai:30076] MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [smpowerai:30076] and potentially your MPI job) [smpowerai:30071] 3 more processes have sent help message help-mpi-btl-openib.txt / no active ports found [smpowerai:30071] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [smpowerai:30071] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal

bvanessen commented 5 years ago

I will talk with the team tomorrow to see if they have any ideas. How did you compile, especially how did you compile aluminum. Do you have the output from the run.

Brian C. Van Essen vanessen1@llnl.gov (w) 925-422-9300 (c) 925-290-5470

On Feb 8, 2019, at 12:35 AM, Raviteja1996 notifications@github.com wrote:

Hello I am trying to run the model using model parallelism. I have made some changes in the prototext file. I am attaching the changed file with this post. I am not getting the error which I faced before, which I posted in #860 but a different one. I am not getting any error while running with single GPU but getting the error while running with multiple GPU's. There is also no problem when running using data parallelism, the error seems to come only when trying with model parallelism.

prototext file : https://drive.google.com/open?id=1wjG1wI-st5cACOueA4oPCiaPZ_hLU91W

command: I am in model_zoo directory and I am using the below command

mpirun -np 4 lbann --model=models/resnet50/model_resnet50.prototext.orig --optimizer=optimizers/opt_adam.prototext --reader=data_readers/data_reader_imagenet.prototext --num_epochs=5

Error: [smpowerai:30076] An error occurred in MPI_Ialltoall [smpowerai:30076] reported by process [3765370881,3] [smpowerai:30076] on communicator MPI_COMM_WORLD [smpowerai:30076] MPI_ERR_INTERN: internal error [smpowerai:30076] MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [smpowerai:30076] and potentially your MPI job) [smpowerai:30071] 3 more processes have sent help message help-mpi-btl-openib.txt / no active ports found [smpowerai:30071] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [smpowerai:30071] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Raviteja1996 commented 5 years ago

output_log.txt

This is the output log I am getting after running

Raviteja1996 commented 5 years ago

Hi any update yet?

ndryden commented 5 years ago

We have two ideas:

One is that the issue is caused by you building without CUB support. Since CUB is part of the CUDA toolkit anyway, I think we plan to make it required, since we haven't really tested without it.
We tentatively think there may be a bug in SMPI's implementation of MPI_Iallreduce, which you may be triggering here.

Can you try building with CUB and seeing if it runs? If the bug is with SMPI, this will be harder to work around.

Raviteja1996 commented 5 years ago

Hello I tried to enable CUB support but still the error continues. I am attaching my log. log.txt

Raviteja1996 commented 5 years ago

Hi is there any update/approach of using the multiple GPU for model parallelism to overcome the problem?

ndryden commented 5 years ago

I have a better idea what may be causing this. It will take a bit of time to get a full solution.

Raviteja1996 commented 5 years ago

Hi, I cloned the updated version of LBANN and I tried run the models. Now as usual I am able to run data-parallelism with multiple GPU. Now even I am able to run model parallelism with multiple GPU but the time its taking is more than 10 times the time taken by data-parallelism. I don't think its the correct result I am getting. So is the problem of multiple GPU for model parallelism is solved and updated or due to any code change it is working but producing some strange results.

timmoon10 commented 5 years ago

It looks like your prototext file converts between data-parallel and model-parallel layouts many times. Each redistribution is fairly expensive (requires an all-to-all), so they should be done sparingly. For example, the AlexNet model in the model zoo performs all of the convolutions in a data-parallel layout, converts to model-parallel for the fully-connected layers, and converts back to data-parallel for the loss function.

One more caveat: the fully-connected layer in ResNet-50 might be too small to get good results with a model-parallel layout. Matrix multiplication is so fast on GPUs that it's hard to get good parallel scaling since communication times quickly dominate. We see some benefit for a model-parallel layout in large fully-connected layers since it uses less memory and has less communication for the gradient accumulation. We also observe that the scaling properties on CPU systems are less finicky.

LLNL / lbann

Error facing while trying to do model parallelism #866