Illegal memory access when composing HashGrid and another encoding

smontode24 commented 2 years ago

Hi,

I am obtaining an illegal memory access when composing a HashGrid encoding with any other encoding (see the example below). I have observed that the error is only happening when composing a HashGrid encoding with another encoding. When using the HashGrid alone or composing a TriangleWave with a OneBlob encoding it works well. I am using python and I have installed the package through pip (pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch).

The following code is raising an illegal memory access:

import torch
import tinycudann as tcnn

config={
    "loss": {
        "otype": "RelativeL2Luminance"
    },
    "optimizer": {
        "otype": "Ema",
        "decay": 0.99,
        "sensitivity": 0.1,
        "nested": {
            "otype": "ExponentialDecay",
            "decay_start": 4000,
            "decay_interval": 4000,
            "decay_base": 0.33,
            "nested": {
                "otype": "Adam",
                "learning_rate": 1e-2,
                "beta1": 0.9,
                "beta2": 0.99,
                "epsilon": 1e-15,
                "l2_reg": 1e-6
            }
        }
    },
    "encoding": {
        "otype": "Composite",
        "nested": [
            {
                "n_dims_to_encode": 3, # Position
                "otype": "HashGrid",
                "per_level_scale": 2.0,
                "log2_hashmap_size": 15,
                "base_resolution": 16,
                "n_levels": 16
            },
            {
                "n_dims_to_encode": 5, # Interesting conditionals
                "otype": "OneBlob",
                "n_bins": 4
            },
            {
                "n_dims_to_encode": 6, # Linear conditionals that should be identity encoded
                "otype": "Identity"
            }
        ]
    },
    "network": {
        "otype": "FullyFusedMLP",
        "activation": "ReLU",
        "output_activation": "None",
        "n_neurons": 64,
        "n_hidden_layers": 2
    }
}

model = tcnn.NetworkWithInputEncoding(14, 4, config["encoding"], config["network"])
model(torch.randn(32, 14).cuda())

Output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sergio/miniconda3/envs/tcnn/lib/python3.9/site-packages/torch/_tensor.py", line 249, in __repr__
    return torch._tensor_str._str(self)
  File "/home/sergio/miniconda3/envs/tcnn/lib/python3.9/site-packages/torch/_tensor_str.py", line 415, in _str
    return _str_intern(self)
  File "/home/sergio/miniconda3/envs/tcnn/lib/python3.9/site-packages/torch/_tensor_str.py", line 390, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/home/sergio/miniconda3/envs/tcnn/lib/python3.9/site-packages/torch/_tensor_str.py", line 251, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "/home/sergio/miniconda3/envs/tcnn/lib/python3.9/site-packages/torch/_tensor_str.py", line 90, in __init__
    nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Additional information:

I am able to run the script sample/mlp_learning_an_image_pytorch.py.
Using Ubuntu 18.04.6 LTS and python 3.9.

There is one closed issue related to this #57, but as it has not been solved, I am opening this issue to see what is the problem. Do you have any idea of what could be the reason behind this or can you think of any workaround?

Thanks in advance!

smontode24 commented 2 years ago

Update: When using separate encoders it seems to work fine, but would be nice to know how to use the Composite encoding, as in the documentation it says that separate encoders and network lead to worse performance (instead of using tcnn.NetworkWithInputEncoding).

The current workaround would be to do something like:

# Model definition
spatial_encoding = tcnn.Encoding(3, hash_encoder_config)
dir_encoding = tcnn.Encoding(3, oneblob_encoder_config)
mlp_network = tcnn.Network(spatial_encoding.n_output_dims + dir_encoding.n_output_dims, n_output_dims, mlp_config)

# Forward pass 
out_spatial = spatial_encoding(coords)
out_vdir = dir_encoding(viewing_dir)
encoding = torch.cat([out_spatial, out_vdir], 1)
out = mlp_network(encoding)

mikeqzy commented 2 years ago

Same problem here. Would switch to separate encoder for now but it seems to be faster to use tcnn.NetworkWithInputEncoding as is claimed in README

Tom94 commented 2 years ago

Fixed on latest master via https://github.com/NVlabs/tiny-cuda-nn/commit/e421b8b2d4a4065e04bf0c724c8ba7652d716239

Many apologies for taking so long to get to this one.

NVlabs / tiny-cuda-nn

Illegal memory access when composing HashGrid and another encoding #98