braceal / molecules

Machine learning for molecular dynamics.
MIT License
5 stars 5 forks source link

"CUDA error: invalid device ordinal" when non 0 gpus are specified. #26

Closed braceal closed 4 years ago

braceal commented 4 years ago
python examples/pytorch/example_vae.py -i ../data/contact_maps.h5 -o ../output/ -m 3 -t symmetric -e 2 -b 128 -E 1 -D 2
CUDA devices:  1,2
Traceback (most recent call last):
  File "examples/pytorch/example_vae.py", line 141, in <module>
    main()
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "examples/pytorch/example_vae.py", line 81, in main
    gpu=(encoder_gpu, decoder_gpu))
  File "/lambda_stor/homes/abrace/molecules/molecules/ml/unsupervised/vae/vae.py", line 167, in __init__
    self.model = VAEModel(input_shape, hparams, self.device)
  File "/lambda_stor/homes/abrace/molecules/molecules/ml/unsupervised/vae/vae.py", line 32, in __init__
    self.decoder.to(device.decoder)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 443, in to
    return self._apply(convert)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 203, in _apply
    module._apply(fn)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 203, in _apply
    module._apply(fn)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 225, in _apply
    param_applied = fn(param)
  File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 441, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal

Branch: feature/multi-gpu-vae

This happens when we set the encoder gpu to 1 and the decoder gpu to 2 (-E 1, -D 2). Any idea what is going on here? @atrifan2

Settings that work: -E 0 -D 1, -E 1 -D 0 Settings that don't: -E 0 -D 2, -E 1 -D 1

Perhaps this could help.

atrifan2 commented 4 years ago

This is interesting, I've never seen this before. I'll try to reproduce this today to see.

Anda


From: Alex Brace notifications@github.com Sent: Tuesday, July 21, 2020 11:00:03 PM To: braceal/molecules molecules@noreply.github.com Cc: Trifan, Anda atrifan2@illinois.edu; Mention mention@noreply.github.com Subject: [braceal/molecules] "CUDA error: invalid device ordinal" when non 0 gpus are specified. (#26)

python examples/pytorch/example_vae.py -i ../data/contact_maps.h5 -o ../output/ -m 3 -t symmetric -e 2 -b 128 -E 1 -D 2 CUDA devices: 1,2 Traceback (most recent call last): File "examples/pytorch/example_vae.py", line 141, in main() File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/click/core.py", line 764, in call return self.main(args, kwargs) File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, ctx.params) File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/click/core.py", line 555, in invoke return callback(args, **kwargs) File "examples/pytorch/example_vae.py", line 81, in main gpu=(encoder_gpu, decoder_gpu)) File "/lambda_stor/homes/abrace/molecules/molecules/ml/unsupervised/vae/vae.py", line 167, in init self.model = VAEModel(input_shape, hparams, self.device) File "/lambda_stor/homes/abrace/molecules/molecules/ml/unsupervised/vae/vae.py", line 32, in init self.decoder.to(device.decoder) File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 443, in to return self._apply(convert) File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 203, in _apply module._apply(fn) File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 203, in _apply module._apply(fn) File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 225, in _apply param_applied = fn(param) File "/lambda_stor/homes/abrace/molecules/conda-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 441, in convert return t.to(device, dtype if t.is_floating_point() else None, non_blocking) RuntimeError: CUDA error: invalid device ordinal

This happens when we set the encoder gpu to 1 and the decoder gpu to 2. Any idea what is going on here? @atrifan2https://github.com/atrifan2

Perhaps thishttps://github.com/allenai/allennlp/issues/1090 could help.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/braceal/molecules/issues/26, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHTMUE3GFO4LVVMDNR5GTRTR4ZP4HANCNFSM4PEJHD6Q.

hengma1001 commented 4 years ago

I am not sure how pytorch handles GPUs. But try export CUDA_VISIBLE_DEVICES=1,2 and just run -E 0 -D 1.

braceal commented 4 years ago

Don't set gpu env inside script. Specify outside in the shell. When CUDA_VISIBLE_DEVICES=2,3 in that process the global device ids 2,3 become 0,1.