NVlabs / nvdiffrast

Nvdiffrast - Modular Primitives for High-Performance Differentiable Rendering
Other
1.35k stars 144 forks source link

Got cudaErrorInvalidDevice error when not using gpu=0 #12

Closed yataoz closed 3 years ago

yataoz commented 3 years ago

Hi,

I was able to run the cube.py example when using gpu=0. But when I switched to other gpus by setting CUDA_VISIBLE_DEVICES in the docker container, I got the error below. I'm pretty sure all gpus are exposed to the docker container because 1.) using nvidia-smi in the container returns all gpu info correctly and 2.) a simple tensorflow test example also worked with your docker image. The error seems to happen only with the rasterizer op. So I wonder if somehow rasterizer op has a bug so that it can only use gpu0 on a machine?

Mesh has 12 triangles and 8 vertices. Setting up TensorFlow plugin "tf_all.cu": Preprocessing... Compiling... Loading... Done. Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Cuda error: cudaErrorInvalidDevice[cudaGraphicsGLRegisterBuffer(&s.cudaPosBuffer, s.glPosBuffer, cudaGraphicsRegisterFlagsWriteDiscard);] [[{{node RasterizeFwd_1}}]] [[Mean_1/_37]] (1) Internal: Cuda error: cudaErrorInvalidDevice[cudaGraphicsGLRegisterBuffer(&s.cudaPosBuffer, s.glPosBuffer, cudaGraphicsRegisterFlagsWriteDiscard);] [[{{node RasterizeFwd_1}}]] 0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "samples/tensorflow/cube.py", line 201, in main() File "samples/tensorflow/cube.py", line 192, in main fit_cube(max_iter=5000, resolution=resolution, discontinuous=discontinuous, log_interval=10, display_interval=display_interval, out_dir=out_dir, log_fn='log.txt', imgsave_interval=1000, imgsavefn='img%06d.png') File "samples/tensorflow/cube.py", line 124, in fit_cube glval, = util.run([geom_loss, train_op], {mtx_in: r_mvp, lr_in: lr}) File "/app/samples/tensorflow/util.py", line 257, in run return tf.get_default_session().run(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Cuda error: cudaErrorInvalidDevice[cudaGraphicsGLRegisterBuffer(&s.cudaPosBuffer, s.glPosBuffer, cudaGraphicsRegisterFlagsWriteDiscard);] [[node RasterizeFwd_1 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] [[Mean_1/_37]] (1) Internal: Cuda error: cudaErrorInvalidDevice[cudaGraphicsGLRegisterBuffer(&s.cudaPosBuffer, s.glPosBuffer, cudaGraphicsRegisterFlagsWriteDiscard);] [[node RasterizeFwd_1 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] 0 successful operations. 0 derived errors ignored.

Original stack trace for 'RasterizeFwd_1': File "samples/tensorflow/cube.py", line 201, in main() File "samples/tensorflow/cube.py", line 192, in main fit_cube(max_iter=5000, resolution=resolution, discontinuous=discontinuous, log_interval=10, display_interval=display_interval, out_dir=out_dir, log_fn='log.txt', imgsave_interval=1000, imgsavefn='img%06d.png') File "samples/tensorflow/cube.py", line 69, in fit_cube rast_outopt, = dr.rasterize(pos_clip_opt, pos_idx, resolution=[resolution, resolution], output_db=False) File "/app/samples/tensorflow/../../nvdiffrast/tensorflow/ops.py", line 108, in rasterize return func(pos) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/custom_gradient.py", line 168, in decorated return _graph_mode_decorator(f, *args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/custom_gradient.py", line 230, in _graph_mode_decorator result, grad_fn = f(args) File "/app/samples/tensorflow/../../nvdiffrast/tensorflow/ops.py", line 97, in func out, out_db = _get_plugin().rasterize_fwd(pos, tri, resolution, ranges, 0, tri_const) File "", line 92, in rasterize_fwd File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in init self._traceback = tf_stack.extract_stack()

hiyyg commented 3 years ago

Same as #11

s-laine commented 3 years ago

I just finished implementing multi-GPU support last week, and I'll push the changes to the repository later today or tomorrow.

s-laine commented 3 years ago

Multi-GPU support is now in repo. Closing this - please open a new issue if you experience problems with the new version!

yataoz commented 3 years ago

Thanks @hiyyg and @s-laine for your help! The new version with multi-gpu support is now working perfectly.