Open Raviteja1996 opened 5 years ago
I will talk with the team tomorrow to see if they have any ideas. How did you compile, especially how did you compile aluminum. Do you have the output from the run.
Brian C. Van Essen vanessen1@llnl.gov (w) 925-422-9300 (c) 925-290-5470
On Feb 8, 2019, at 12:35 AM, Raviteja1996 notifications@github.com wrote:
Hello I am trying to run the model using model parallelism. I have made some changes in the prototext file. I am attaching the changed file with this post. I am not getting the error which I faced before, which I posted in #860 but a different one. I am not getting any error while running with single GPU but getting the error while running with multiple GPU's. There is also no problem when running using data parallelism, the error seems to come only when trying with model parallelism.
prototext file : https://drive.google.com/open?id=1wjG1wI-st5cACOueA4oPCiaPZ_hLU91W
command: I am in model_zoo directory and I am using the below command
mpirun -np 4 lbann --model=models/resnet50/model_resnet50.prototext.orig --optimizer=optimizers/opt_adam.prototext --reader=data_readers/data_reader_imagenet.prototext --num_epochs=5
Error: [smpowerai:30076] An error occurred in MPI_Ialltoall [smpowerai:30076] reported by process [3765370881,3] [smpowerai:30076] on communicator MPI_COMM_WORLD [smpowerai:30076] MPI_ERR_INTERN: internal error [smpowerai:30076] MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [smpowerai:30076] and potentially your MPI job) [smpowerai:30071] 3 more processes have sent help message help-mpi-btl-openib.txt / no active ports found [smpowerai:30071] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [smpowerai:30071] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
This is the output log I am getting after running
Hi any update yet?
We have two ideas:
MPI_Iallreduce
, which you may be triggering here.Can you try building with CUB and seeing if it runs? If the bug is with SMPI, this will be harder to work around.
Hello I tried to enable CUB support but still the error continues. I am attaching my log. log.txt
Hi is there any update/approach of using the multiple GPU for model parallelism to overcome the problem?
I have a better idea what may be causing this. It will take a bit of time to get a full solution.
Hi, I cloned the updated version of LBANN and I tried run the models. Now as usual I am able to run data-parallelism with multiple GPU. Now even I am able to run model parallelism with multiple GPU but the time its taking is more than 10 times the time taken by data-parallelism. I don't think its the correct result I am getting. So is the problem of multiple GPU for model parallelism is solved and updated or due to any code change it is working but producing some strange results.
It looks like your prototext file converts between data-parallel and model-parallel layouts many times. Each redistribution is fairly expensive (requires an all-to-all), so they should be done sparingly. For example, the AlexNet model in the model zoo performs all of the convolutions in a data-parallel layout, converts to model-parallel for the fully-connected layers, and converts back to data-parallel for the loss function.
One more caveat: the fully-connected layer in ResNet-50 might be too small to get good results with a model-parallel layout. Matrix multiplication is so fast on GPUs that it's hard to get good parallel scaling since communication times quickly dominate. We see some benefit for a model-parallel layout in large fully-connected layers since it uses less memory and has less communication for the gradient accumulation. We also observe that the scaling properties on CPU systems are less finicky.
Hello I am trying to run the model using model parallelism. I have made some changes in the prototext file. I am attaching the changed file with this post. I am not getting the error which I faced before, which I posted in #860 but a different one. I am not getting any error while running with single GPU but getting the error while running with multiple GPU's. There is also no problem when running using data parallelism(multiple GPU), the error seems to come only when trying with model parallelism.
prototext file : https://drive.google.com/open?id=1wjG1wI-st5cACOueA4oPCiaPZ_hLU91W
command: I am in model_zoo directory and I am using the below command
mpirun -np 4 lbann --model=models/resnet50/model_resnet50.prototext.orig --optimizer=optimizers/opt_adam.prototext --reader=data_readers/data_reader_imagenet.prototext --num_epochs=5
Error: [smpowerai:30076] An error occurred in MPI_Ialltoall [smpowerai:30076] reported by process [3765370881,3] [smpowerai:30076] on communicator MPI_COMM_WORLD [smpowerai:30076] MPI_ERR_INTERN: internal error [smpowerai:30076] MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [smpowerai:30076] and potentially your MPI job) [smpowerai:30071] 3 more processes have sent help message help-mpi-btl-openib.txt / no active ports found [smpowerai:30071] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [smpowerai:30071] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal