Closed Nqabz closed 7 years ago
Hi, thanks for your question! I will be able to address this question in a couple of days, and will do so at my first opportunity. Sorry for the delay!
Adam thanks for note!
Here is an update on my question: I installed NCCL 1.3.4 and code seems to compile but not I am getting the following error from synkhronos (still running th e lasagne_mnist.py example:
Starting training...
Traceback (most recent call last):
File "/home/model_params/synkhronos/demos/lasagne_mnist.py", line 396, in <module>
main(**kwargs)
File "/home/model_params/synkhronos/demos/lasagne_mnist.py", line 326, in main
train_err += train_fn(X_train_synk, y_train_synk, batch=batch)
File "/opt/conda/lib/python3.5/site-packages/synkhronos/function_module.py", line 487, in __call__
self._share_input_data(ordered_inputs, batch, batch_s)
File "/opt/conda/lib/python3.5/site-packages/synkhronos/function_module.py", line 367, in _share_input_data
scatterer.assign_inputs(synk_inputs, batch, self._n_scat)
File "/opt/conda/lib/python3.5/site-packages/synkhronos/scatterer.py", line 90, in assign_inputs
batch = check_batch_types(batch)
File "/opt/conda/lib/python3.5/site-packages/synkhronos/scatterer.py", line 145, in check_batch_types
if not np.issubdtype(batch, int):
File "/opt/conda/lib/python3.5/site-packages/numpy/core/numerictypes.py", line 761, in issubdtype
return issubclass(dtype(arg1).type, val)
TypeError: data type not understood
Thanks,
Nqabz
This error should be fixed with the latest push.
I will review the lasagne-mnist demo and might update it soon. :)
Great. Will be appreciated if you can update the lasagne-mnist demo too. I will check-in the latest push for the other demos and let you know.
OK I just updated the lasagne_mnist demos and they should all be working. While in your synkhronos directory, use "demos/run_lasagne_demos.py" to get a quick view of the speedup. You might need to use a large batch size for good speedup.
Then go into the demo files and see how to set it up. :)
Adam do you mind If I ask how you built your Python3 and Theano to be able to work with latest NCCL toolkit? Are you using NCCLv2?
Are you able to share your DGX container's dockerfile minus proprietary packages that you maybe using?
Thanks so much on updating lasagne-mnist I will test it tomorrow.
The exact Theano version I'm using is: '0.9.0.dev-e79c4e4c83c5a4907ea7fddf073fd2d659df7486' I have maybe one or two things I've tweaked for convenience but nothing for functionality.
I'm using NCCL 1.3.2 on a 2-GPU workstation. I also have v1 on a DGX-1, no v2 yet.
I haven't pulled Theano in a long time, but I'll do that soon as I think some fixes in 0.10 will help with bugs found through here. :)
Also haven't set up a DGX docker file, have just been running directly in the machine.
Ok now makes sense when you mention that you are using NCCL 1.3.2. I went for v2 and the API has changed quite a lot. With the API changes in v2, pygpu package is acting out:(.
I will write back to this thread once I have some answers. I have reached out to NVIDIA Enterprise support for this.
That so much for the great work!
I might look at contributing the docker-file if it can be of use to someone.
I did a quick test of the new code (pushed on August 23) using a docker container.
Building and compiling is quick. distribution of functions completes in ~43s for the cnn model. Thereafter the training does not start seem to have an indefinite wait ...code hangs. Is it supposed to wait for more than 10 minutes??
On comparing to with the old code the very same 'cnn' model: distribution of functions (takes 33s) and thereafter (in less than a second) training begins on all 8gpus.
I am not sure what to make of these discrepancies?
That sounds like a problem! I've been testing on my 2-GPU machine, I should be able to get back on the DGX in a day or so, and will look at this right away. You could try running with synk.fork(n_gpus=2)
and see what happens.
In the meantime, a quick hack to speed up the distribution is to change Theano/theano/gpuarray/dnn.py
, maybe somewhere around line 275, where it says version.v = None
. And change it to say version.v = 6020
or whatever version of cudnn you have. This is how I run. The long synk.distribute()
time for 8 GPUs comes from them all fighting for the compile lock while figuring out the cudnn version.
Adam thanks for the tips.
I can confirm that "lasagne_mnist_gpu_data.py" "cnn" model gives a clean run with synk.fork(2)
,synk.fork(3)
, ... ,synk.fork(6),
synk.fork(7)`
synk.fork(8)
:Mapped name None to device cuda7: Tesla P100-SXM2-16GB (0000:8A:00.0)
Synkhronos: 8 GPUs initialized, master rank: 0
Building model and compiling functions...
Synkhronos distributing functions...
...distribution complete (42 s).
Could this be because the code need to reserve one gpu to fork other subprocesses on the remaining gpus? I thought forking was happening from the cpu. Very strangely withsynk.fork(8)
code does train to completion for the "mlp" model.
I am able to recreate this problem. And I've pinned it to a faulty socket connection inside the cpu-comm unit (uses ZeroMQ), which is used in scatter()
. Mysterious behavior, as it doesn't happen every time, but it is always the last GPU.
An equally valid alternative that avoids using the ZeroMQ-based cpu-comm unit is to first build a synk data object, and pass that to the scatter command. For example, replace: synk.scatter(x_var, some_numpy_array)
with data = synk.data(some_numpy_array); synk.scatter(x_var, data)
.
I'll keep you posted as I figure out what's going on.
OK found it. There was a tiny typo breaking the ZeroMQ socket connection of the last worker. Pushed the fix already, should be working now.
Thanks for finding that!
Adam, Thanks for getting back to this. I did a quick test of the new push. My run still gets stuck in ZeroMQ for the 'cnn' model. Did you check with the 'cnn' model as default? I see your push still has the 'mlp' as default.
Using cuDNN version 6021 on context None
Mapped name None to device cuda5: Tesla P100-SXM2-16GB (0000:86:00.0)
Synkhronos: 8 GPUs initialized, master rank: 0
Building model and compiling functions...
Synkhronos distributing functions...
...distribution complete (41 s).
Scattering data to GPUs.
Strangely this script'lasagne_mnist_cpu_data.py'
runs on all 8gpus while the 'lasagne_mnist_gpu_data.py' runs on 7gpus. Internally the two scripts look identical.
Earlier you suggested this test "demos/run_lasagne_demos.py
".
how does do you differentiate between the two scripts: 'lasagne_mnist_cpu_data.py' and 'lasagne_mnist_gpu_data.py' given that they both fork and bind to gpus.
Yes, mine runs with the cnn model, all 8 GPUs. Can you check that in your synkhronos/comm.py
, line 64, it now says:
pub_port = pub_socket.bind_to_random_port(
Previously it was:
pub_port = socket.bind_to_random_port(
This should be all that is needed to fix. If not, please let me know and we'll reopen!
The lasagne_mnist_cpu_data.py
script does not use the ZeroMQ-based communication.
The difference between the scripts is that in the cpu_data
one, all the data is held on the CPU, and sent to GPU at each function call. The gpu_data
script puts all the data on the GPUs ahead of time, and simply sends the GPUs a set of random indexes at each function call. For large enough batch size, should see some speedup in gpu_data
.
Indeed it is set to pub_port = pub_socket.bind_to_random_port(
What version of ZeroMQ are you using? Perhaps I can check to match. Thanks for clarifying the difference between ..._cpu_data.py and ..._gpu_data.py.
You're welcome!
Hmm this is interesting then... pyzmq 16.0.2 (latest version in conda for python 3.5) Let me switch over to another computer and I'll push a test I was using...
Ok there is now a test at tests/zeromq_test.py
, which does the same thing as starting up zeromq in synkhronos (but the test does not call anything in synkhronos). Give that a try and let me know the result...it should run through 7 workers, have them all receive a test string, and then exit.
Edit: you can also try tests/cpu_comm_test.py
which does use synkhronos but is much simpler test than the lasagne demo.
Just ran the tests/cpu_comm_test.py
this morning. It appears to hace completed successfully. Is the following in line with what you expected?
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
master had pair port 1 : 49836
master had pair port 2 : 50986
master had pair port 3 : 28296
master had pair port 4 : 63531
master had pair port 5 : 20008
master had pair port 6 : 57752
master had pair port 7 : 18118
skipping port idx 0
sending test string in loop, 1
1 connecting to port 49836
1 polling for test string
sending test string in loop, 2
1 result of poll: 1
1 attempting to receive string
1 passed recv test
2 connecting to port 50986
2 polling for test string
sending test string in loop, 3
2 result of poll: 1
2 attempting to receive string
2 passed recv test
3 connecting to port 28296
3 polling for test string
sending test string in loop, 4
3 result of poll: 1
3 attempting to receive string
3 passed recv test
4 connecting to port 63531
4 polling for test string
sending test string in loop, 5
4 result of poll: 1
4 attempting to receive string
4 passed recv test
5 connecting to port 20008
5 polling for test string
sending test string in loop, 6
5 result of poll: 1
5 attempting to receive string
5 passed recv test
6 connecting to port 57752
6 polling for test string
sending test string in loop, 7
6 result of poll: 1
6 attempting to receive string
6 passed recv test
7 connecting to port 18118
7 polling for test string
dont with test string loop
7 result of poll: 1
7 attempting to receive string
7 passed recv test
Yes that looks correct from zeromq_test.py
. And cpu_comm_test.py
should run and close without any output, but not hang.
If both of these work for you but lasagne_mnist_gpu_data.py
still hangs when scattering data...then this is mysterious. Do you have your synkhronos pip installed as editable?
I just checked cpu_comm_test.py
hangs as well??
I do have synkhronos installed in a docker container, not sure yet if its editable. I will check? Is there something you would suggest that I change?
Ok I'm not sure how it works with docker but my guess is that the container has not incorporated the updates to synkhronos. ?
When I run in native OS (including in a conda env) you can do pip install -e .
(with the period) from your local folder of the git repo. Then when you git pull all the changes are applied without having to reinstall with pip.
Ummm... that might be the problem. In a day or two I will rebuild the container and retest.
ok sounds good, let me know the result.
also beware I just renamed the repository from "synkhronos" to "Synkhronos", in keeping with Theano, Lasagne, etc.
@astooke Both tests works after rebuilding the container. However, I have a follow up to make. Not sure if I should open another issue.
It seems building model and compiling functions...
times out after a few seconds when running the current cnn model with my data.
Using cuDNN version 6021 on context None
Mapped name None to device cuda5: Tesla P100-SXM2-16GB (0000:86:00.0)
Synkhronos: 6 GPUs initialized, master rank: 0
Building model and compiling functions...
stops after compiling. My training data is of this size:
X_train is : (16000, 1, 144, 144)
y_train is : (16000,)
Seems the code does not get past the writing of data into shared memory:
# Write data into input shared memory
X_train_synk, y_train_synk = train_fn.build_inputs(X_train, y_train)
Never mind ... found the culprit. I had to increase --shm-size
to --shm-size="2048m"
.
I am shooting for testing my large models on a cluster with more than 2000 nodes (one node has 4 P100 GPU cards). What are your experiences with scaling your Synkhronos beyond one DGX-1 box?
There is a bug in your latest push fortrain_mnist_gpu_data.py
, i think its due to change of folder structure and some typos.
grad_updates, param_updates, grad_shared = updates.nesterov_momentum(
loss, params, learning_rate=0.01, momentum=0.9)
learning_rate
is not a keyword argument anymore in the imported instance of updates
. Latest pushed code works based on positional argument for the learning rate.
I have resorted to fix my directory structure to be based your previous push and its working fine.
Hmm, it is running for me, so maybe that was a temporary problem in the middle of a bunch of updates?
I had previously programmed the kwarg lr
but changed it to learning_rate
in keeping with Lasagne.
Thanks for being patient as things are changing rapidly...I hope this has settled out and I'm only doing documentation and typo fixes now.
Thanks to your for tips and the great package.
What are your experiences with scaling Synkhronos beyond one DGX-1 box? I am looking at 2000 nodes each with 4 P100 GPU cards) and training with more than 40Gig of image data?
Wow that is a lot of distributed compute! Starting a new issue (#12) for that discussion.
Is the issue with the demo fixed so we can close this?
Correct - lets close this issue. The demo works and has helped shape my current models. Thanks for opening the issue to discuss multinode support.
Looks like something is going wrong. Code terminates after printing "... distribution complete". I am running this on a DGX-1 container. Is this something expected from training with a DGX-1 container???