astooke / Synkhronos

Extension to Theano for multi-GPU data parallelism
MIT License
20 stars 5 forks source link

Error on fork() #2

Closed mharradon closed 7 years ago

mharradon commented 7 years ago
Synkhronos: 16 GPUs initialized, master rank: 0
Using cuDNN version 5110 on context None
Mapped name None to device cuda9: Tesla K80 (0000:00:18.0)
Process Process-11:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/worker.py", line 56, in worker_main
    connect_as_worker(n_parallel, rank, master_rank, use_gpu)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/comm.py", line 35, in connect_as_worker
    gpu = GpuCommWorker(n_parallel, rank, master_rank)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/comm.py", line 262, in __init__
    clique_id = gpu_coll.GpuCommCliqueId(gpu_ctx)
  File "pygpu/collectives.pyx", line 33, in pygpu.collectives.GpuCommCliqueId.__cinit__ (pygpu/collectives.c:2619)
  File "pygpu/collectives.pyx", line 316, in pygpu.collectives.comm_generate_id (pygpu/collectives.c:5472)
pygpu.gpuarray.GpuArrayException: b'Error loading library'
Using cuDNN version 5110 on context None
Mapped name None to device cuda12: Tesla K80 (0000:00:1B.0)
Process Process-14:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/worker.py", line 56, in worker_main
    connect_as_worker(n_parallel, rank, master_rank, use_gpu)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/comm.py", line 35, in connect_as_worker
    gpu = GpuCommWorker(n_parallel, rank, master_rank)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/comm.py", line 262, in __init__
    clique_id = gpu_coll.GpuCommCliqueId(gpu_ctx)
  File "pygpu/collectives.pyx", line 33, in pygpu.collectives.GpuCommCliqueId.__cinit__ (pygpu/collectives.c:2619)
  File "pygpu/collectives.pyx", line 316, in pygpu.collectives.comm_generate_id (pygpu/collectives.c:5472)
pygpu.gpuarray.GpuArrayException: b'Error loading library'
Using cuDNN version 5110 on context None
Mapped name None to device cuda2: Tesla K80 (0000:00:11.0)

...

Traceback (most recent call last):
  File "./runMyCode.py", line 91, in <module>
    run(BGA_params,train_params,opt_dict)
  File "/home/ubuntu/MyCode.py", line 803, in run
    BGA = AAR(**BGA_params)
  File "/home/ubuntu/MyCode.py", line 214, in __init__
    self.build()
  File "/home/ubuntu/MyCode.py", line 268, in build
    synk.fork()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/forking.py", line 35, in fork
    connect_as_master(n_parallel, master_rank, master_rank, use_gpu)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/comm.py", line 28, in connect_as_master
    gpu = GpuCommMaster(n_parallel, rank, master_rank)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/comm.py", line 262, in __init__
    clique_id = gpu_coll.GpuCommCliqueId(gpu_ctx)
  File "pygpu/collectives.pyx", line 33, in pygpu.collectives.GpuCommCliqueId.__cinit__ (pygpu/collectives.c:2619)
  File "pygpu/collectives.pyx", line 316, in pygpu.collectives.comm_generate_id (pygpu/collectives.c:5472)
pygpu.gpuarray.GpuArrayException: b'Error loading library'

pygpu.version.fullversion '0.6.4'

pygpu.test() passes, code runs fine on cuda0 without synkhronos.

Using 'device=cpu,force_device=True' in THEANO_FLAGS.

Thanks for any tips!

astooke commented 7 years ago

Have you installed NCCL? https://github.com/NVIDIA/nccl

I thought I would catch this by not being able to import pygpu collectives, but apparently that's not the case, thanks!

mharradon commented 7 years ago

I installed both packages here: https://github.com/NVIDIA/nccl/releases

Appears to be compiling now :D

I'm getting this error now on distribute():

Mapped name None to device cuda8: Tesla K80 (0000:00:17.0)
Mapped name None to device cuda15: Tesla K80 (0000:00:1E.0)
Mapped name None to device cuda7: Tesla K80 (0000:00:16.0)
Mapped name None to device cuda10: Tesla K80 (0000:00:19.0)
Mapped name None to device cuda14: Tesla K80 (0000:00:1D.0)
Mapped name None to device cuda13: Tesla K80 (0000:00:1C.0)
Mapped name None to device cuda2: Tesla K80 (0000:00:11.0)
Mapped name None to device cuda11: Tesla K80 (0000:00:1A.0)
Mapped name None to device cuda9: Tesla K80 (0000:00:18.0)
Mapped name None to device cuda5: Tesla K80 (0000:00:14.0)
Mapped name None to device cuda4: Tesla K80 (0000:00:13.0)
Mapped name None to device cuda3: Tesla K80 (0000:00:12.0)
Mapped name None to device cuda1: Tesla K80 (0000:00:10.0)
Mapped name None to device cuda0: Tesla K80 (0000:00:0F.0)
Mapped name None to device cuda6: Tesla K80 (0000:00:15.0)
Mapped name None to device cuda12: Tesla K80 (0000:00:1B.0)
Synkhronos: 16 GPUs initialized, master rank: 0
Dumped network architecture to network_desc.txt
Setting output nodes
Building function...
Synkhronos distributing functions...
Traceback (most recent call last):
  File "./runMyCode.py", line 91, in <module>
    run(BGA_params,train_params,opt_dict)
  File "/home/ubuntu/MyCode.py", line 801, in run
    BGA = AAR(**BGA_params)
  File "/home/ubuntu/MyCode.py", line 224, in __init__
    self.train_fn = self.build_train_fn(loss,losses)
  File "/home/ubuntu/MyCode.py", line 599, in build_train_fn
    synk.distribute()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/function_builder.py", line 134, in distribute
    with open(PKL_FILE, "wb") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/pkl/synk_f_dump_76722.pkl'

The pkl directory did not exist, so I created it and am running again now.

astooke commented 7 years ago

Right! Have just created this directory with dummy file in it. Thanks!

mharradon commented 7 years ago

I changed PKL_PATH to a directory in /dev/shm to get a little performance boost - also my box was running out of primary disk šŸ™„ . Since you've already got unixy dependencies maybe that might be a better default? I guess the user could change it with an env var or some config file.

My functions take a long time to build, so the debug cycle is slow, but almost there I think!

nouiz commented 7 years ago

To speed up Theano compilation while developing, you can use this flag:

optimizer=fast_compile

It won't have stability optimization to have them, use optimizer=stabilize

The execution speed will be slower, but compilation will be faster.

On Tue, May 2, 2017 at 3:46 PM Michael Harradon notifications@github.com wrote:

I changed PKL_PATH to a directory in /dev/shm to get a little performance boost - also my box was running out of primary disk šŸ™„ . Since you've already got unixy dependencies maybe that might be a better default? I guess the user could change it with an env var or some config file.

My functions take a long time to build, so the debug cycle is slow, but almost there I think!

ā€” You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/astooke/synkhronos/issues/2#issuecomment-298740653, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-6OtHfSux8qCthvo3acBKfoTbK-kks5r14f3gaJpZM4NOcS4 .

mharradon commented 7 years ago

Good point nouiz, I always forget to do that when debugging!

I think I've solved the original issues here - I'll open up another issue for anything else for posterity. I'll see how far I can get!

astooke commented 7 years ago

Good idea to allow a smarter pickle path...it really can be anywhere with read/write privilege.

Would you mind saying how long it takes to distribute the functions vs compiling them in the first place? Pickling happens very fast, but it takes a while to unpickle because all the workers are fighting for the compile lock. Ideally we could have the workers do less work on the function when unpickling...something I'll bring up with @nouiz :)

Are your functions carrying large amounts of data in shared variables? I've also thought about using the pickling mode which does not store any function data, and then just using the broadcast functionality already here to get all the workers to initialize with the same shared variable values. So far my functions have only carried a few MB of shared data, but if it's more like GB maybe this is worth doing.

mharradon commented 7 years ago

I think right now I'm around something like 20 minutes compile (with fast_run), 10 minutes distribute with pickle in /dev/shm. My model is about 1 GB, but it's not a huge bottleneck for me right now.